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Abstract — We consider the problem of communicating over a 
channel for which no mathematical model is specified. We present 
achievable rates as a function of the channel input and output 
known a-posteriori for discrete and continuous channels, as well 
as a rate-adaptive scheme employing feedback which achieves 
these rates asymptotically without prior knowledge of the channel 
behavior. 



I. Introduction 

The problem of communicating over a channel with an 
individual, predetermined noise sequence which is not known 
to the sender and receiver was addressed by Shayevitz and 
Feder HI Q and Eswaran et al HHJ. The simple example 
discussed in fl] is of a binary channel y n — x n ffi e„ where 
the error sequence e n can be any unknown sequence. Using 
perfect feedback and common randomness, communication is 
shown to be possible in a rate approaching the capacity of the 
binary symmetric channel (BSC) where the error probability 
equals the empirical error probability of the sequence (the 
relative number of T-s in e n ). Subsequently both authors 
extended this model to general discrete channels and modulu- 
additive channels (O, resp.) with an individual state 
sequence, and showed that the empirical mutual information 
can be attained. 

Now we take this model one step further. We consider 
a channel where no specific probabilistic or mathematical 
relation between the input and the output is assumed. In order 
to define positive communication rates without assumptions 
on the channel, we characterize the achievable rate using the 
specific input and output sequences, and we term this channel 
an individual channel. This way of treating with unknown 
channels is different from other concepts of dealing with the 
problem, such as compound channels and arbitrarily varying 
channels, in the fact that the later require a specification of 
the channel model up to some unknown parameters, whereas 
the current approach makes no a-priori assumptions about 
the channel behavior. We usually assume the existence of a 
feedback link in which the channel output or other information 
from the decoder can be sent back to the encoder. Without 
this feedback it would not be possible to match the rate of 
transmission to the quality of the channel so outage would be 
inevitable. 

Although one may not be fully convenient with the mathe- 
matical formulation of the problem, there is no question about 
the reality of this model: this is the only channel model that 
we know for sure exists in nature. This point of view is similar 
to the approach used in universal source coding of individual 
sequences where the goal is to asymptotically attain for each 



sequence the same coding rate achieved by the best encoder 
from a model class, tuned to the sequence. 

Just to inspire thought, let's ask the following question: 
suppose the sequence {cCi}f =1 with power P = i Y^i=i x i 
encodes a message and is transmitted over a continuous real- 
valued input channel. The output sequence is {yi}™ =1 . One 
can think of Vi — yi — Xi as a noise sequence and measure its 

power N = — Yl7=i v 1- ^ s trie rate ^ = \ 1°§ (•"■ + w) wn i cn 
is the Gaussian channel capacity, achievable in this case, under 
appropriate definitions ? 

The way it was posed, the answer to this question would 
be "no", since this model predicts a rate of | bit/use for the 
channel whose output is Mi : yi = which cannot convey any 
information. However with the slight restatement done in the 
next section the answer would be "yes". 

We consider two classes of individual channels: discrete 
input and output channels and continuous real valued input 
and output channels, and two communication models: with 
feedback and without feedback. In both cases we assume 
common randomness exists. The case of feedback is of higher 
interest, since the encoder can adapt the transmission rate 
and avoid outage. The case of no-feedback is used as an 
intermediate step, but the results are interesting since they 
can be used for analysis of semi probabilistic models. The 
main result is that with a small amount of feedback, a com- 
munication at a rate close to the empirical mutual information 
(or its Gaussian equivalent for continuous channels) can be 
achieved, without any prior knowledge, or assumptions, about 
the channel structure. 

The paper is organized as follows: in section III] we give a 
high level overview of the results. In section |III-B| we define 
the model and notation. Section [IV] deals with communication 
without feedback where the results pertaining to discrete and 
continuous case are formalized and proven, and the choice 
of the rate function and the Gaussian prior for the continuous 
case is justified. SectionJV] deals with the case where feedback 
is present. After reviewing similar results we state the main 
result and the adaptive rate scheme that achieves it, and delay 



the proof to section VI Here, the error probability and the 



achieved rate are analyzed and bounded. Section VII gives 



several examples, and section VIII is dedicated to comments 
and highlights areas for further study. 

II. Overview of main results 

We start with a high level overview of the definitions 
and results. The definitions below are conceptual rather than 
accurate, and detailed definitions follow in the next sections. 



A rate function is a function R c 



x n x y n 



of 



the input and output sequences. In communication without 
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feedback we say a given rate function is achievable if for 
large block size n — > oo, it is possible to communicate at 
rate R and an arbitrarily small error probability is obtained 
whenever i? cmp exceeds the rate of transmission, i.e. whenever 
-Rcmp(x,y) > R. In communication with feedback we say a 
given rate function is achieved by a communication scheme 
if for large block size n, data at rate close to or exceeding 
i? cmp (x,y) is decoded successfully with arbitrarily large 
probability for every output sequence and almost every input 
sequence. Roughly speaking, this means that in any instance 
of the system operation, where a specific x was the input and 
a specific y was the output, the communication rate had been 
at least i? omp (x,y). Note that the only statistical assumptions 
are related to the common randomness, and we consider the 
rate and error probability conditioned on a specific input and 
output, where the error probability is averaged over common 
randomness. We say that a rate function R cmp is an optimal 
(but not the optimal) function if any R' cmp > i? C mp which is 
strictly larger than i? cmp at at least one point, is not achievable. 

The definition of achievability is not complete without 
stating the input distribution, since it affects the empirical 
rate. For example, by setting x = one can attain every rate 
function where i? C mp(0,y) = in a void way, since other x 
sequences will never appear. Different from classical results in 
information theory, we do not use the input distribution only as 
a means to show the existence of good codes: taking advantage 
of the common randomness we require the encoder to emit 
input symbols that are random and distributed according to a 
defined prior (currently we assume i.i.d. distribution). 

The choice of the rate functions is arbitrary in a way: for 
any pair of encoder and decoder, we can tailor a function 
^?cmp(x,y) as a function equaling the transmitted rate when- 
ever the error probability given the two sequences (averaged 
over messages and the common randomness) is sufficiently 
small, and otherwise. However it is clear that there are 
certain rates which cannot be exceeded uniformly. Our interest 
will focus on simple functions of the input and output, 
and specifically in this paper we focus on functions of the 
instantaneous (zero order) empirical statistics. Extension to 
higher order models seems technical. 

For the discrete channel we show that a rate 

ficm P = A x ;y) (!) 

is achievable with any input distribution Px where /(•; •) 
denotes the empirical mutual information [5] (see definition 
in section |HI-B| and Theorems [T] [3J. For the continuous (real 
valued) channel we show that a rate 

^ P = ^log( 1 _. ( 1 xy)2 ) (2) 

is achievable with Gaussian input distribution 7V(0, P), where 
p is the empirical correlation factor between the input and 
output sequences (see Theorems [2] [4). These results pertain 
both to the case of feedback and of no-feedback according to 
the definitions above. 

Throughout the current paper we define correlation factor 
in a slightly non standard way as p = , y ' (that is, 

& 1 ^/e(X 2 )E(Y 2 ) 



without subtracting the mean). This is done only to simplify 
definitions and derivations, and similar claims can be made us- 
ing the correlation factor defined in the standard way. Although 
the result regarding the continuous case is less tight, we show 
that this is the best rate function that can be defined by second 
order moments, and is tight for the Gaussian additive channel 
(for this channel p 2 = p ^ N therefore R cmp — \ log (l + J^)) 
We may now rephrase our example question from the in- 
troduction so that it will have an affirmative answer: given the 
input and output sequences, describe the output by the virtual 
additive channel with a gain yi — axi + Vi, so the effective 
noise sequence is Vi = yi — ax^ Chose a so that vlx, i.e. 
n Si y i x i = 0- An equivalent condition is that a minimizes 
llvll 2 . The resulting a is the LMMSE coefficient in estimation 

T 

of y from x (assuming zero mean), i.e. a — jrap- Define the 
effective noise power as N = - Y^7=i v f> anc ^ tne effective 
SNR = 9 jf^. It is easy to check that SNR = where 

T 

P — ii n ii i s m e empirical correlation factor between x and 
i x ii ' ii y i | | 

y. Then according to Eq.d2b the rate R = ~ log (1 + SNR) 
is achievable, in the sense defined above. Reexamining the 
counter example we gave above, in this model if we set y = 
we obtain p = and therefore i? C mp = 0, or equivalently the 
effective channel has v = and a = 0, therefore SNR = 
(instead of v = — x, a = 1 and SNR = 1). 

As will be seen, we achieve these rates by random coding 
and universal decoders. For the case of feedback we use 
iterated instances of rateless coding (i.e. we encode a fixed 
number of bits and the decision time depends on the channel). 
The scheme is able to operate asymptotically with "zero rate" 
feedback (meaning any positive capacity of the feedback chan- 
nel suffices). A similar although more complicated scheme was 
used in [3 1 (see a comparison in the appendix). 

Before the detailed presentation we would like to examine 
the differences between the model used here and two prox- 
imate models: the arbitrarily varying channel (AVC) and the 
channel with individual noise sequence. 

In the AVC (see for example |6||7|), the channel is defined 
by a probabilistic model which includes an unknown state 
sequence. Constraints on the sequence (such as power, number 
of errors) may be defined, and the target is to communi- 
cate equally well over all possible occurrences of the state 
sequence. In AVC, the capacity depends on the existence of 
common randomness and on whether the average or maximum 
error probability (over the messages) is required to approach 0, 
yet when sufficient common randomness is used, the capacities 
for maximum and average error probability are equal. The 
notes in |6) regarding common randomness and randomized 
encoders (see p. 2151) are also relevant to our case. 

A treatment of AVC-s which is similar in spirit to our results 
exists in watermarking problems. For example a rather general 
case of AVC is discussed in |8|. They consider communication 
over a black box (representing the attacker) which is only 
limited to a given level D of distortion according to a 
predefined metric, but has otherwise a block-wise undefined 
behavior. They show that it is possible to achieve a rate equal 
to the rate-distortion function of the input Rx(D), if the 
black box guarantees a given level of average distortion in 
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high probability. This result is similar to our Theorem [T] The 
remarkable distinction from other results for AVC is that the 
rate is determined using a constraint on the channel inputs 
and outputs, rather than the channel state sequence. We note 
that for the Gaussian additive channel the above result is 
suboptimal since the rate is Rx(N) = |log(P/iV) and our 
results improve this result by using the correlation factor yields 
rather than the mean squared error. See further discussion 
of these results in the proof Lemma [T] and the discussion 
following Theorem [3] 

Channels with individual noise (or state) sequence are 
treated by Shayevitz and Feder |1||2| and Eswaran et al [|3j. 
The probabilistic setting is the same as in the AVC, and 
the difference is that instead of achieving a uniform (hence 
worst-case) rate, the target is to achieve a variable rate which 
depends on the particular sequence of noise, using a feedback 
link. In this setup, prior constraints on the state sequence can 
be relaxed. As opposed to AVC where the capacity is well 
defined, the target rate for each state sequence is determined 
in a somewhat arbitrary way (since many different constraints 
on the sequence can be defined). As an example, in the binary 
channel of [ 1 1, a rate of would be obtained for the sequence 
e =' 01010101...' since the empirical error probability is 
|, although obviously a scheme which favors this specific 
sequence and achieves a rate of 1 can be designed. On the 
other hand, with the AVC approach communication over this 
channel would not be possible without prior constraints on the 
noise sequence. Channels with individual noise sequence can 
be thought of as compound-AVCs (i.e. an AVC with unknown 
parameter, in this case, the constraint). As in AVC, existence 
of common randomness as well as the definition of error 
probability affect the achievable rates. 

In the individual channel model we use here, since no 
equation with state sequence connecting the input and output is 
given, the achievable rates cannot be defined without relating 
to the channel input. Therefore the definitions of achieved rates 
depend in a somewhat circular way on the channel input which 
is determined by the scheme itself. Currently we circumvent 
this difficulty by constraining the input distribution, as men- 
tioned above. 

In many aspects the model used in this paper is more 
stringent than the AVC and the individual noise sequence 
models, since it makes less assumptions on the channel, and 
the error probability is required to be met for (almost) every 
input and output sequence (rather than on average). In other 
aspects it is lenient since we may attribute 'bad' channel 
behavior to the rate rather than suffer an error, therefore the 
error exponents are better than in probabilistic models. This 
is further explained in section |IV-A| 

The model we propose suggests a new approach for the 
design of communication systems. The classical point of 
view first assumes a channel model and then devises a 
communication system optimized for it. Here we take the 
inverse direction: we devise a communication system without 
assumptions on the channel which guarantees rates depending 
on channel behavior. This change of viewpoint does not make 
probabilistic or semi probabilistic channel models redundant 
but merely suggests an alternative. By using a channel model 



we can formalize questions relating to optimality such as 
capacity (single user, networks) and error exponent as well 
as guarantee a communication rate a-priori. Another aspect 
is that we pay a price for universality. Even if one considers 
an individual channel scheme that guarantees asymptotically 
optimum rates over a large class of channels, it can never 
consider all possible channels (block-wise), and for a finite 
block size it will have a larger overhead (a reduction in 
the amount of information communicated with same error 
probability) compared to a scheme optimized for the specific 
channel. 

Following our results, the individual channel approach be- 
comes a very natural starting point for determining achievable 
rates for various probabilistic and arbitrary models (AVC-s, 
individual noise sequences, probabilistic models, compound 
channels) under the realm of randomized encoders, since 
the achievable rates for these models follow easily from the 
achievable rates for specific sequences, and the law of large 
numbers. We will give some examples later on. 

III. Definitions and notation 

A. Notation 

In general we use uppercase letters to denote random 
variables, respective lowercase letters to denote their sample 
values and boldface letters to denote vectors, which are by 
default of length n. However we deviate from this practice 
when the change of case leads to confusion, and vectors 
are always denoted by lowercase letters even when they are 
random variables. 

||xj| = \/x T x denotes L2 norm. We denote by P o Q 
the product of conditional probability functions e.g. (P o 
Q)(x,y) = P(x) ■ Q(y\x). A hat (□) denotes an estimated 
value. 

We denote the empirical distribution as P (e.g. 

P(x.,y){z,y) = k^i=i S (^i-x),(.yi-y))- The source vectors 
x, y and/or the variables x, y are sometimes omitted when 
they are clear from the context. We denote by H(-), !(■; ■), 
/}(•;•) the empirical entropy, the empirical mutual information 
and the empirical correlation factor, which are the respective 
values calculated for the empirical distribution. All expressions 
such as H(x), H(x\y), f(x;y), f(x;y|z), J(x;y|z = z ) 
are interpreted as their respective probabilistic counterparts 
H(X), H{X\Y), I{X;Y), I(X;Y\Z), I(X;Y\Z = z ) 
where (X, Y, Z) are random variables distributed according 
to the empirical distribution of the vectors -P( x .y,z)> or 
equivalently are defined as a random selection of an element 
of the vectors i.e. (X, Y, Z) — (xi,yi, Zi),i ~ U{l,...,n}. 
It is clear from this equivalence that relations on entropy and 
mutual information (e.g. positivity, chain rules) are directly 
translated to relations on their empirical counterparts. 

We apply superscript and subscript indices to vectors 
to define subsequences in the standard way, i.e. x^ = 

(Xi , Xi-\- \ , . . . , 5C = Xj 

We denote I(P, W) the mutual information I(X; Y) when 
(X, Y) ~ P(x) - W(y\x). V(A) denotes a uniform distribution 
over the set A. Ber(p) denotes the Bernoulli distribution, and 

hi,(p) = H(Ber(p)) = —plogp — (1 — p) log(l — p) denotes 
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the binary entropy function. The indicator function Ind(P) 
where E is a set or a probabilistic event is defined as 1 over 
the set (or when the event occurs) and otherwise. 

The functions log(-) and exp(-) as well as information 
theoretic quantities H(-), !(■;■), D(-\\-) refer to the same, 
unspecified base. We use the term "information unit" as the 
unit of these quantities (equals io ^ 2 ) bits). 

The notation /„ = 0(g n ) and /„ < 0(g n ) (or equivalently 



0(f n ) = 0(g n ) and 0(/„) < 0(g n )) means 



In 



> const > 



and 



</» i 



«0 respectively. 



Throughout this paper we use the term "continuous" to refer 
to the continuous real valued channel E — > R, although this 
definition does not cover all continuous input - continuous 
output channels. By the term "discrete" in this paper we always 
refer to finite alphabets (as opposed to countable ones). 

B. Definitions 

Definition 1 (Channel). A channel is defined by a pair of 
input and output alphabets X, y, and denoted X — > y 

Definition 2 (Fixed rate encoder, decoder, error probability). 
A randomized block encoder and decoder pair for the channel 
X — > y with block length n and rate R without feedback is 
defined by a random variable S distributed over the set S, a 
mapping <f> : {1,2, . . . exp(ni?)} x5-t X n and a mapping 
4> ■ y n x S — » {1, 2, . . . exp(ni?)}. The error probability for 
message w € {1, 2, . . . exp(nR)} is defined as 



P e W(x,y) = Pr (4>(y, S) w\<j>{w, S) = x) 



(3) 



where for x such that the condition cannot hold, we define 

pW(x,y)=0. 

Note that the encoder rate must pertain to a discrete number 
of messages exp(nP) G Z+, but the empirical rates defined 
in the following theorems may be any positive real numbers. 

Definition 3 (Adaptive rate encoder, decoder, error proba- 
bility). A randomized block encoder and decoder pair for 
the channel X — > y with block length n, adaptive rate and 
feedback is defined as follows: 

• The message w is expressed by the infinite sequence 

€{0,1}°° 

• The common randomness is defined as a random variable 
S distributed over the set S 

• The feedback alphabet is denoted T 

• The encoder is defined by a series of mappings Xk = 



c (w, s, f fe_1 ) where cp k : {0, 1}°° x5xf 



k-l 



X. 



• The decoder is defined by the feedback function <pk : 
y^ 1 x S — > T, the decoding function (f> : y n x S — > 
{0, 1}°° and the rate function rJ"x5^R+ (where 
the rate is measured in bits), applied as follows: 

fk = n(y k ,S) (4) 
w = 0(y,5) (5) 
R = r(y,S) (6) 

The error probability for message w is defined as 



P e (w) (x,y) =Pr(w[ nR] ^w[ ,ii?1 



(7) 



In other words, a recovery of the first \nR\ bits by the 
decoder is considered a successful reception. For x such that 
the condition cannot hold, we define P e (w) (x,y) = 0. The 
conditioning on y is mainly for clarification, since it can be 
treated as a fixed vector. This system is illustrated in figure [2] 

Note that if we are not interested in limiting the feedback 
rate, and perfect feedback can be assumed, the definition of 
feedback alphabet and feedback function is redundant (in this 
case J- = y and fk = Uk)- The model in which the decoder 
determines the transmission rate is lenient in the sense that 
it gives the flexibility to exchange rate for error probability: 
the decoder may estimate the error probability and decrease 
it by reducing the decoding rate. In the scheme we discuss 
here the rate is determined during reception, but it's worth 
noting in this context the posterior matching scheme [9| for 
the known memoryless channel. In this scheme the message is 
represented as a real number 9 £ [0,1) and the rate for a given 
error probability P e can be determined after the decoding 
by calculating Pr(0|y) and finding the smallest interval with 
probability at least 1 — P e . 

IV. Communication without feedback 

In this section we show that the empirical mutual informa- 
tion (in the discrete case) and its Gaussian counterpart (in the 
continuous case) are achievable in the sense defined in the 
overview. For the continuous case we justify the choice of the 
Gaussian distribution as the one yielding the maximum rate 
function that can be defined by second order moments. 

A. The discrete channel without feedback 

The following theorem formalizes the achievability of rate 
J(x; y) without feedback: 

Theorem 1 (Non-adaptive, discrete channel). Given discrete 
input and output alphabets X,y, for every P e > 0, S > 0, 
prior Q(x) over X and rate R > there exists n large enough 
and a random encoder-decoder pair of rate R over block size 
n, such that the distribution of the input sequence is x ~ Q n 
and the probability of error for any message given an input 
sequence x G X n and output sequence y € y n is not greater 
than P e if I(x, y) > R + 5. 

Theorem [T] follows almost immediately from the following 
lemma, which is proven in the appendix using simple a 
calculation based on the method of types |10|: 

Lemma 1. For any sequence y G y n the probability of a 
sequence x G X n drawn independently according to Q n to 
have /(x; y) > t is upper bounded by: 

Q" (/(x; y) > t) < exp (-n (t - 5 n )) (8) 

where S n = \X \\y\ M=±l> ^ 0. 

Following notations in iflOl . Q n (A) denotes the probability 
of the event A or equivalently the set of sequences A under the 
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Fig. 2. Rate adaptive encoder-decoder pair with feedback 



i.i.d. distribution Q n . Remarkably this bound does not depend 
on Q. 

To prove Theorem 111 the codebook {Xm}m^i is ran- 
domly generated by i.i.d. selection of its L — exp(ni?) • n 
letters, so that the common randomness S 6 X L may be 
defined as the codebook itself and is distributed Q L . The 
encoder sends the w-th codeword, and the decoder uses 
maximum mutual information decoding (MMI) i.e. chooses: 



= <Xy, { x ™}) = argmax I(x m ; y) 



(9) 



where ties are broken arbitrarily. By Lemma [TJ the probability 
of error is bounded by: 

P e W (x.„y)<Pr| |J (i(x m ;y)>i(x w ;y)) I < 
exp(ni?) exp (— n ( I(x w ; y) 



< exn( nR) exn 

= exp \-n (7(x w ; y) - R - S n ) ) (1 1 

log(Pe) 



For any S there is n large enough such that 



+ S n < 5. 



For this n, whenever 7(x; y) > R + S we have 

Pi w) {x,y) < exp (-n < P e (11) 

which proves the theorem. □ 
Note that the MMI decoder used here is a popular universal 
decoder (see [5 1[ 10 1[ 1 1 ]), and was shown to achieve the same 
error exponent as the maximum likelihood decoder for fixed 
composition codes. The error exponent obtained here is better 
than the classical error exponent (slope of -1), and the reason 
is that the behavior of the channel is known, and therefore 
no errors occur as result of non-typical channel behavior. 
Comparing for example with the derivation of the random 
coding error exponent for the probabilistic DMC based on the 
method of types (see ifTUl ), in the later the error probability is 
summed across all potential "behaviors" (conditional types) 
of the channel accounting for their respective probabilities 



(resulting in one behavior, usually different from the typical 
behavior, dominating the bound), while here the behavior 
of the channel (the conditional distribution) is fixed, and 
therefore the error exponent is better. This is not necessarily 
the best error exponent that can be achieved (see lfTT1lfT2ll 
which discuss error exponent with random decision time and 
feedback for probabilistic and compound models). 

Note that the empirical mutual information is always well 
defined, even when some of the input and output symbols do 
not appear in the sequence, since at least one input symbol and 
one output symbol always appear. For the particular case of 
empirical mutual information measured over a single symbol, 
the empirical distributions become unit vectors (representing 
constants) and their mutual information is 0. 

In this discussion we have not dealt with the issue of choos- 
ing the prior Q(x). Since the channel behavior is unknown it 
makes sense to choose the maximum entropy, i.e. the uniform, 
prior which was shown to obtain a bounded loss from capacity 

in. 

B. The continuous channel without feedback 

When turning to define empirical rates for the real valued 
alphabet case, the first obstacle we tackle is the definition 
of the empirical mutual information. A potential approach 
is to use discrete approximations. We only briefly describe 
this approach since it is somewhat arbitrary and less elegant 
than in the discrete case. The main focus is on empirical rates 
defined by the correlation factor. Although the later approach 
is pessimistic and falls short of the mutual information for 
most channels, it is much simpler and elegant than discrete 
approximations. We believe this approach can be further 
extended to obtain results closer to the (probabilistic) mutual 
information. 

1) Discrete approximations: Define the continuous input 
and output alphabets X,y. Suppose Q is an arbitrary (con- 
tinuous) prior. Define input and output quantizers to discrete 
alphabets A n : X — > X n and B n : y — > y n where X n , 
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y n are discrete alphabets of growing size, chosen to grow 



slowly enough so that S rl 



\Xni\y, 



I log(n+l) 



>0. Define 



the empirical mutual information between continuous vectors 
as the empirical mutual information between their quantized 
versions (quantized letter by letter): 



/ jM ,(x,y) = I(A,(x),B n (y)) 



(12) 



Then based on Lemma [T[ by using a random codebook drawn 
according to Q and applying a maximum mutual information 
decoder using the above definition, we could asymptotically 
achieve the rate function R cmp = -Ta,b(x, y) based on the 
definitions of Theorem [TJ The main issue with this approach 
is that determining A n , B n is arbitrary, and especially B n is 
difficult to define when the output range is unknown. Therefore 
in the following we focus on the suboptimal approach using 
the correlation factor. 

2) Choosing the input distribution and rate function: First 
we justify our choice of the Gaussian input distribution and 
the aforementioned rate function. We take the point of view 
of a compound (probabilistic, unknown) channel. If a rate 
function cannot be attained for compound channel model, 
it cannot be attained also in the more stringent individual 
model. It is well known that for a memoryless additive noise 
channel with constraints on the transmit power and noise 
variance, the Gaussian noise is the worst noise when the 
prior is Gaussian, and the Gaussian prior is the best prior 
when the noise is Gaussian. Thus by choosing a Gaussian 
prior we choose the best prior for the worst noise, and can 
we guarantee the mutual information will equal, at least, the 
Gaussian channel capacity. See the "mutual information game" 
(problem 9.21) in [14|. For the additive noise channel |[T5ll 
shows the loss from capacity when using Gaussian distribution 
is limited to h a bit. However the above is true only for additive 
noise channels. For the more general where no additivity is 
assumed case we show below (Lemma[3]l that the rate function 
R = — i log(l— p 2 ) is the best rate function that can be defined 
by second order moments, and attained universally. Of course, 
this proof merely supplies the motivation to use a Gaussian 
distribution and does not rid us from the need to prove this 
rate is achievable for specific, individual sequences. 

Lemma 2. Let X,Y be two continuous random variables with 
correlation factor p = e(xy) w here X is Gaussian 

J P ^/E(X 2 )E(Y 2 ) 

X ~ Af(Q, P). Then I(X; Y) > -§ log(l - p 2 ) 

Corollary 2.1. Equality holds iff X,Y are jointly Gaussian 

Corollary 2.2. The lemma does not hold for general X ( not 
Gaussian) 

The proof is given in the appendix. Note that — | log(l — p 2 ) 
is the mutual information of two Gaussian r.v-s ([ 14 1, example 
8.5.1). Also note the relation to Theorem 1 in [16| dealing 
with an additive channel with uncorrected, but not necessarily 
independent noise. The following lemma justifies our selection 
of i?(p) = -±log(l-p 2 ): 

Lemma 3. Let Q(x) be an input prior, W(y\x) be an 
unknown channel, A(Q, W) be the correlation matrix A = 



E r^J \y) between X, Y induced by the joint probability 
Q o W and p(Q, W) be the correlation factor induced by 
=). We say a function R(A) is an attainable 



second order rate function if there exists a Q(x) such that 
for every channel W(y\x) inducing correlation A the mutual 
information is at least R(A) (in other words can carry the rate 
R(A)j. Then R(A) = — | log(l — p 2 ) is the largest attainable 
second order rate function. 

Alternatively this can be stated as: 



R(A) = max min I(Q, W) 

Q W:A(Q,W)=A 



log(l-p 2 ) (13) 



Proof of lemma |ij R(A) = — |log(l — p 2 ) is attainable 
by selecting an input prior Q — J\f(0,a 2 ) and by lemma 
[2] the mutual information is at least R(A) for all channels. 
R(A) is the maximum attainable function since by writing 
the condition of the lemma for the additive white gaussian 
noise (AWGN) channel W* (a specific choice of W) and any 
Q, we have R(A) < I(Q,W*) < I(Af(0, E P (X 2 ), W*)) = 
— \ log(l — p 2 ), where the inequalities follow from the condi- 
tions of the lemma on R and from the fact the Gaussian prior 
achieves the AWGN capacity. 

3) Communication scheme for the empirical channel (with- 
out feedback): The following theorem is the analogue of 
Theorem [T] where the expression — |log(l — p 2 ) (interpreted 
as the Gaussian effective mutual information) plays the role 
of mutual information. 

Theorem 2 (Non-adaptive, continuous channel). Given the 
channel R — > R for every P e > 0, 5 > 0, power P > 
and rate R > there exists n large enough and a random 
encoder-decoder pair of rate R over block size n, such that 
the distribution of the input sequence is x ~ Af n (0, P) and the 
probability of error for any message given an input sequence 
x and output sequence y with empirical correlation p is not 
greater than P e if R cmp = \ log \ J > R + 8 

As before, the theorem will follow easily from the following 
lemma, proven in the appendix. 

T 

Lemma 4. Let x,y£ M." be two sequences, and p = ^ 
be the empirical correlation factor. For any y, the probability 
of x drawn according to J\f n (0, a 2 ) to have \p\ > t is bounded 
by: 

Pr(|p| >t) <2exp(-(n-l)R 2 (t)) (14) 



where 



(15) 



To prove Theorem [2] the codebook {x TO } 



cxp(ni?) 



domly generated by Gaussian i.i.d. selection of its L = 
exp(ni?) • n letters, and the common randomness S £ X L 
is defined as the codebook itself and is distributed AT L (0, P). 
The encoder sends the w-th codeword, and the decoder uses 
maximum empirical correlation decoder i.e. chooses: 



(y,{x m }) = argmax|p(x m ;y)|= argmax 



Kyi 



(16) 
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where ties are broken arbitrarily. By Lemma |4] the probability 
of error is bounded by: 

P e W (x Ji „y)<Pr| |J (l^y^l^y^U 

I m^w j 

< exp(ni?) • 2exp(-(n- l)R 2 (p(x w ;y)))) = 

= 2exp(i?) ■ exp (-(n - 1) (R 2 (p) - R)) (17) 

Choosing n large enough so that ^i? + log < ^ 

(where P e is from Theorem [2]) we have that when i?2(p) > 
R + S: 



Pi w \x, y) < 2exp(i?) • exp (-(n - 1)5) < P e 



(18) 



which proves the theorem. □ 
A note is due regarding the definition of p in singular cases 
where x or y are 0. The limit of p as y — > is undefined (the 
directional derivative may take any value in [0,1]), however 
for consistency we define p = when y = 0. Since x is 
generated from a Gaussian distribution we do not worry about 
the event x = since the probability of this event is 0. 

It's worth spending a few words on the connections between 
the receivers used for the discrete and the continuous cases. 
Since the mutual information between two Gaussian r.v-s is 
— h log(l — p 2 ), one can think of this value as a measure of 
mutual information under Gaussian assumptions. Thus, using 
this metric as an effective mutual information, since the mutual 
information is an increasing function of |p| the MMI decoder 
becomes a maximum empirical correlation decoder. On the 
other hand, the receiver we used can be identified as the GLRT 
(generalized maximum likelihood ratio test) for the AWGN 
channel Y — aX + Af(0, a 2 ) with a an unknown parameter, 
resulting from maximizing the likelihood of the codeword and 
the channel simultaneously: 

w — argmaxmaxlogPr(y|x; a) = 

■ II n2 ( x Iy) 2 

= argmm mm y — ax m = argmax -r. rrrr = 

a m l|x m |r 

= argmax [p 2 (x m , y ) ] (19) 

The choice of the GLRT is motivated by considering the 
individual channel as an effective additive channel with un- 
known gain (as presented in section D|, combined with the 
fact Gaussian noise is the worse. For discrete memoryless 
channels it is easy to show that the GLRT (where the group of 
channels consists of all DMC-s) is synonymous with the MMI 
decoder (see O). Thus, we can identify the two decoders as 
GLRT decoders, or equivalently as variants of MMI decoders. 
In the sequel we sometimes use the term "empirical mutual 
information" in a broad sense that includes also the metric 
-|log(l-p 2 ). 

Regarding the receiver required to obtain the rates of 
Theorem |2j it is interesting to consider the simpler maximum 
projection receiver argmax|x^y|. This receiver seems to 
differ from the maximum correlation receiver only in the term 
||x m || which is nearly constant for large n due to the law 
of large numbers. However surprisingly, the maximum rate 



achievable with the projection receiver is only \p 2 as can 
be shown by a simple calculation equivalent to Lemma [4] 
(simpler, since z — x T y is Gaussian). The reason is that 
when x is chosen independently of y, a large value of the 
projection (non typical event) is usually created by a sequence 
with power significantly exceeding the average (another non 
typical event). When one non-typical event occurs there is no 
reason to believe the sequence is typical in other senses thus 
the approximation ||x m || « \fnP is invalid. The correlation 
receiver normalizes by the power of x and compensates 
this effect. An alternative receiver which yields the rates of 
Theorem [2] and is similar to the AEP receiver looks for the 
codeword with the maximum absolute projection subject to 
power limited to i||x TO || 2 < P + e. This can be shown 
by Sanov theorem ifTUl or by using the Chernoff bound. 
The maximum correlation receiver was chosen because of its 
elegance and the simplicity of the proof of Lemma [4] 

Combining this lemma with the law of large numbers 
provides a simple proof for the achievability of the AWGN 
capacity (A log(l+SM?)), which uses much simpler mechanics 
than the popular proof based on AEP or error exponents. This 
receiver has the technical advantage, compared to the AEP 
receiver, that it does not declare an error for codewords which 
have power deviating from the nominal power. This technical 
advantage is important in the context of rateless decoding since 
the power condition needs to be re-validated each symbol, thus 
increasing its contribution to the overall error probability. 

Lapidoth [17] showed that the nearest neighbor receiver 
achieves a rate equal to the Gaussian capacity | log(l + P/7V) 
over the additive channel Y = X + V with arbitrary noise 
distribution (with fixed noise power). This result parallels the 
result that the random code capacity of the AVC Y = X + V 
with a power constraint on V equals the Gaussian capacity [ 18 1 
(this stems directly from the characterization of the random 
code capacity of the AVC as maxp x r x '\ minp s ( s ) I(X; Y), 
cf. lfTOl Eq.(V.4)). Our result is stronger since it does not 
assume the channel is additive (nor any fixed behavior), but 
considering the former results it is not surprising, if one 
assumes (1) that any channel can be modeled as Y — aX + V 
with V _L X, (2) that the dependence of V on X does not 
increase the error probability due to orthogonality (see [16|) 
and (3) that the loss from the single unknown parameter a is 
asymptotically small. 

Another related result is Agarwal et al's J8) result that it 
is possible to communicate with a rate approaching the rate- 
distortion function Rx(D) over an arbitrarily varying channel 
with unknown block-wise behavior satisfying a distortion 
constraint Ed(x,y) < D in high probability. This relation 
is further discussed in the proof of Lemma [T] Their result is 
similar to ours in the fact they define the rate in terms of the 
input and output alone. The result is similar to obtaining the 
rate function R emp « Rx(Ed(x, y)) in the sense of Theorems 



l"|2 However their result is not tight even for the Gaussian 
channels: for the gaussian channel Y = X + V with noise 
V limited to power N and the Gaussian prior X ~ jV(0, P) 
this rate function equals Rx(N) = | log (J^) which is smaller 
than this channel capacity, whereas with Theorem [2] we would 
obtain | log (l + j?) . Agarwal' s result is tight in the sense that 
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this is the maximum rate that can be guaranteed given this 
distortion. There exists a channel with the same distortion N 
whose capacity is only ~ log (Jr): the channel Y — aX + (3V 
with a = 1 = 1 — ^p. The reason for the sub-optimality of the 
result is that the squared distance between the input and output, 
in contrast with the correlation factor, does not yield a tight 
representation of all memoryless linear Gaussian channels (in 
the sense of Lemma [3). 



V. Communication with feedback 
A. Overview and background 

In this section we present the rate-adaptive counterparts of 
Theorems [T] [2j and the scheme achieving them. The proof is 
delayed to the next section. The scheme we use in order to 
adaptively attain these rates is by iterating a rateless coding 
scheme. In other words, in each iteration we send a fixed 
number of bits K, by transmitting symbols from an n length 
codebook, until the receiver has enough information to decode. 
Then, the receiver sends an indication that the block is over 
and a new block starts. 

Before developing the details we give some background 
regarding the evolution of rateless codes, and the differences 
between the proposed techniques. The earliest work is of 
Burnashev |[T2l who showed that for known channels, using 
feedback and a random decision time (i.e. decision time 
which depends on the channel output) yields an improved 
error exponent, which is attained by a 3 step protocol (best 
described in ifTTI ) and shown to be optimal. Shulman |fl9ll 
proposed to use random decision time as a means to deal with 
sending common information over broadcast channels (static 
broadcasting), and for unknown compound channels (which 
are treated as broadcast). In this scheme later described as 
"rateless coding" (or Incremental Redundancy Hybrid ARQ) 
a codebook of exp(K) infinite sequences is generated, and 
the sequence representing the message is sent to the receiver 
symbol by symbol, until the receiver decides to decode (and 
turn off, in case of a broadcast channel). Tchamkerten and 
Telatar ifTTIl connect the two results by showing that for some, 
but not all compound channels Burnashev error exponent 
can be attained universally using rateless coding and the 3 
step protocol. Eswaran, Sarwate, Sahai and Gastpar [3] used 
iterated rateless coding to achieve the mutual information 
related to the empirical noise statistics on channels with 
individual noise sequences. The scheme we use here is most 
similar to the one used in [3 | but less complicated. We do not 
use training symbols to learn the channel in order to decide on 
the decoding time but rely on the mutual information itself as 
the criterion (based on Lemmas |l|4| i and the partitioning into 
blocks and the decision rules are simpler. The result in O is 
an extension of a result in [ 1 1 regarding the binary channel to 
general discrete channels with individual noise sequence. The 
original result in 1 1 1 was obtained not by rateless codes but by 
a successive estimation scheme ll20ll which is a generalization 
of the Horstein ETTl and Schalkwijk-Kailath [22 1 schemes. 
The same authors extend their results to discrete channels 
1121 using successive schemes (where the target rate is the 
capacity of the respective modulu-additive channel). The two 



concepts in achieving the empirical rates differ in various 
factors such as complexity and the amount of feedback and 
randomization required. The successive schemes require less 
common randomness but assume perfect feedback, while the 
schemes based on rateless coding require less (asymptotically 
rate) feedback but potentially more randomness. 

As noted the technique we use here is similar to that of f3j 
in its high level structure, while the structure of the rateless 
decoder is similar to lfl9lr s (chapter 3). The application of this 
scheme to individual inputs and outputs and the extension to 
real-valued models requires proof and especially issues such 
as abnormal behavior of specific (e.g. last) symbols have to be 
treated carefully. The result of [3 1 cannot be applied directly 
to individual channels since the channel model cannot be 
extracted based on the input and output sequences alone, and 
in the later both the model and the sequence are assumed to 
be fixed (over common randomness). 

B. Statement of the main result 

In this section we prove the following theorems, relating to 
the definitions given in section [Tll-B| 

Theorem 3 (Rate adaptive, discrete channels). Given discrete 
input and output alphabets X,y, for every P e > 0, Pa > 0, 
5 > and prior Q(x) over X there is n large enough and 
random encoder and decoder with feedback and variable rate 
over block size n with a subset J C X n , such that: 

• The distribution of the input sequence is x ~ Q n 
independently of the feedback and message 

• The probability of error is smaller than P e for any x, y 

• For any input sequence x J and output sequence y S 
y n the rate is R > l(x, y) - 5 

« The probability of J is bounded by Pr(x G J) < Pa 

Theorem 4 (Rate adaptive, continuous channels). Given the 
channel K — > Rfor every P e > 0, Pa > 0, 5 > 0, R > 0, and 
power P > there is n large enough and random encoder 
and decoder with feedback and variable rate over block size 
n with a subset J C R™, such that 

• The distribution of the input sequence is x ~ J\f(0,P) n 
independently of the feedback and message 

• The probability of error is smaller than P e for any x, y 

• For any input sequence x J and output sequence y £ 



the rate is R > min 



!i°g(i=^W)-<a 



• The probability of J is bounded by Pr(x £ J) < Pa 

Note that in the last theorem we do not have uniform 
convergence of the rate function in x, y. Unfortunately our 
scheme is limited by having a maximum rate for each n, 
and although the maximum rate tends to infinity as n — ► oo, 
we cannot guarantee uniform convergence for each n in the 
continuous case, where the target rate may be unbounded. The 
rates in the theorems are the minimal rates, and in certain 
conditions (e.g. a channel varying in time) higher rates may 
be achieved by the scheme proposed below. 

Regarding the set J as we shall see in the sequel there are 
some sequences for which poor rate is obtained, and since we 
committed to an input distribution we cannot avoid them (one 
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- Tsrgel function: R= -1/2 lcg(l-p 2 ) = 1/2 log (1 + SNR^,) 

;r bound R^j achived by proposed scheme for (n=le+008jC=le+006) 
lower bound R^ = mfn(R-c, R mm ) 




becomes conditioned on the set J. The question whether the 
set J itself is truly necessary (i.e. is it possible to attain the 
above Theorems with J = 0) is still open. 

Figure ([SI illustrates the lower bound for i? cmp presented by 
Theorem|4l(i?LB2) as well as a (higher) lower bound -Rlbi for 



the rate achieved by the proposed scheme (see section VI-C2 



Eq.( [65)). The parameters generating these curves appear in 
table |III| in the appendix. 

We prove the two theorems together. First we define the 
scheme, and in the next section we analyze its error per- 
formance and rate and show it achieves the promise of the 
theorems. Throughout this section and the following one we 
use n to denote the length of a complete transmission, and m 
to denote the length of a single block. 



Taigel limc[ion:R = -l/21og(l-p )-l/21og(l + SNR^) 

4.5 - lower bound R LB1 aehived by proposed scheme for (n=le+008JC=le+006) ^ 



" R onp lower bound R LB1 =mui(R-s,R mas ) 




Effective SNR [dB] 



if as a ] 



lower bound -Rlbi shown in the jjroof in section |V1-C2| as a function o 
(top) and the effective SNR = jzr^i (bottom). Parameters appear in table 
in the appendix 



example is the sequence of \n zeros followed by \n ones, 
in which at most one block will be sent). However there is 
an important distinction between claiming for example that 
"for each y the probability of R < R cmp is at most Pa" 
and the claim made in the theorems that "i? < i? cm p only 
when x belongs to a subset J with probability at most Pa"- 
The first claim is weaker since smartly chosen y may increase 
the probability (see figure |4}. This is avoided in the second 
claim. A consequence of this definition is that the probability 
of R < R cmp is bounded by Pa for any conditional probability 
Pr(y|x) on the sequences. This issue is further discussed in 
section IVI-AI 

Note that the probability Pa could be absorbed into P e 
by a simple trick, but this seems to make the Theorem 
less insightful. After reception the receiver knows the input 
sequence in probability of at least 1 — P e and may calculate 
the empirical mutual information 7(x, y). If the rate achieved 
by the scheme we will describe later falls short of 7(x,y) it 
may declare a rate of R = j(x,y) (which will most likely 
result in a decoding error). This way the receiver will never 
declare a rate which is lower than j(x, y) unless there is an 
error, and we could avoid the restriction x ^ J required for 
achieving i? e mp, but on the other hand, the error probability 



C. A proposed rate adaptive scheme 

The following communication scheme sends B indices from 
{1,...,M} over n channel uses (or equivalently sends the 
number e [0, 1) in resolution M~ B ), where M is fixed, and 
B varies according to empirical channel behavior. The building 
block is a rateless transmission of one of M codewords (K = 
log(Af) information units), which is iterated until the n-th 
symbol is reached. 

The transmit distribution Q is an arbitrary distribution for 
the discrete case and Q = Af(0, P) for the continuous case. 
We define the decoding metric as the empirical rate: 



-Remp(x,y) = 



lQ g( iV(x,y) ) 



discrete 
continuous 



(20) 



The codebook Cmxu consists of M codewords of length n, 
where all M x n symbols are drawn i.i.d. ~ Q and known 
to the sender and receiver. For brevity of notation we denote 
i?™ lp (x, y) instead of i? emp (x5", y™)- k denotes the absolute 
time index 1 < k < n. Block b starts from index fcf,, where 
fci = 1. m = k — kb + 1 denotes the time index inside the 
current block. 

In each rateless block b — 1,2,..., a new index i — 
ib E {1, . . . , M} is sent to the receiver using the following 
procedure: 

1) The encoder sends index i by sending the symbols of 
codeword i: 

Xk = C itk (21) 

Note that different blocks use different symbols from 
the codebook. 

2) The encoder keeps sending symbols and incrementing k 
until the decoder announces the end of the block through 
the feedback link. 

3) The decoder announces the end of the block after symbol 
m in the block if for any codeword Xi : 



^cmp( X !>y) — ^cmp (( x i)fe,,! Yk b 



> Mr, 



(22) 



where p,* m is a fixed threshold per symbol defined in 
Eq.([23]l below. 

4) When the end of block is announced one of the i 



fulfilling Eq.(22i is determined as the index of the 
decoded codeword ib (breaking ties arbitrarily). 
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5) Otherwise the transmission continues, until the n-th 
symbol is reached. If symbol n is reached without ful- 



filling Eq.(22i, then the last block is terminated without 
decoding. 

After a block ends, b is incremented and if k < n a new 
block starts at symbol kb = k+1. After symbol n is reached 
the transmission stops and the number of blocks sent is B = 
6-1. 

The threshold pj^ is defined as: 

K 



1 / n 

+ log 77 

m — s m — s \ P e 

X-+log( : %) + |*||;y|log(m+l) 
m 

m— 1 



discrete 
continuous 



(23) 



where s = for the discrete case and 1 for the continuous case 
and 8 m is defined in Lemma [T] for the discrete case and equals 
for the continuous case. The threshold p* n is tailored 
to achieve the designated error probability and is composed 
of 3 parts. The first part requires that the empirical rate R cmp 
would approximately equal the transmission rate of the block 
— , which guarantees there is approximately enough mutual 
information to send K information units. The second part is an 
offset responsible for guaranteeing error probability bounded 
by P e over all the blocks in the transmission. The third part 
5 m compensates the overhead terms in Lemmas |1|4| 

The scheme achieves the claims of Theorems |3|4| with a 
proper choice of the parameters (discussed in section |VI-C| l. 
Note that the scheme uses feedback rate of 1 bit/use however 
it is easy to show any positive feedback rate is sufficient (see 
section |VI-C| >, therefore we can claim the theorems hold with 
"zero rate" feedback. 

We devote the next section to the analysis of the error prob- 
ability and rate of the scheme, showing it attains Theorems 
3)4 Unfortunately although the scheme is simple, the current 
analysis we have is somewhat cumbersome. 

VI. Proof of the main result 

In this section we analyze the adaptive rate scheme pre- 
sented and show it achieves Theorems 3]4 Before analyzing 
the scheme we develop some general results pertaining to the 
convexity of the mutual information and correlation factors 
over sub-vectors. The proof of the error probability is simple 
(based on the construction of p,^) and common to the two 
cases. The proof of the achieved rate is more complex and 
performed separately for each case. 

A. Preliminaries 

1 ) Likely convexity of the mutual information: A property 
which would be useful for the analysis is U-convexity of the 
empirical mutual information with respect to joint empirical 
distributions P( x y )(a;,y) measured over different sub-vectors, 
so for example we would like to have for < m < n: 



*(x?;y?) < m ■I(K\yT)+ (l- -) ■ii< l+ x,y r r, 
n V n / 



(24) 



which would guarantee that if we achieve a rate equal to the 
empirical mutual information over the two sections < k < m 
and m < k < n, then we would achieve the empirical mutual 
information over the entire vector < k < n. However this 
property does not hold in general since the mutual information 
is not convex with respect to the joint distribution. The mutual 
information I(P, W) is known to be convex U with respect to 
W and concave n with respect to P, so if, for example, the 
conditional distributions over the sections [1, m] and [m+1, n] 
are equal and only the distribution of x differs, the condition 
would in general not hold. On the other hand should the 
empirical distributions of x™ and x" 1+1 be equal, then the 



empirical mutual information expressions appearing in Eq.(24i 
would differ only in the conditional distributions of y w.r.t x 
and the assertion would hold. Since we generate x by i.i.d. 
drawing of its elements the empirical distributions converge 
to the prior Q, and we would expect that if the size of both 
regions m and m — n is large enough the convexity would 
hold up to a fraction e in high probability. We show below that 
such convexity holds under even milder conditions. The cases 
in which this approximate convexity is used later on can serve 
as examples of the difference between the individual model 
used here and probabilistic models (including the individual 
noise sequence). We use the lemma to: 

1) Bound the loss due to insufficient utilization of the last 
symbol in each rateless encoding block. 

2) Bound the loss due to not completing the last rateless 
encoding block. 

3) Show that the average rate (empirical mutual informa- 
tion) over multiple blocks equals at least the mutual 
information measured over the blocks together 

Had the rate been averaged over multiple sequences x rather 
than obtained for a specific sequence, the regular convexity of 
the mutual information with respect to the channel distribution 
would have been sufficient. The property is formalized in the 
following lemma: 

Lemma 5 (Likely convexity of mutual information). Let 

{Ai}^ =1 define a disjoint partitioning of the index set 
{1, . . . , n}, i.e. (J. A\ = {1, . . . , n} and A4 D Aj = for i =/= 
j. x , y are n-length sequences, and Xa, Ya define the sub- 
sequences ofx, y (resp.) over the index set A. Let the elements 
of x. be chosen i.i.d. with distribution Q. Then for any A > 
there is a subset Ja C X n such that: 



Vx £ ,/ A ,y e y n : V ^i/(x Ai ;y A4 ) > 7(x; y) - A 
t=i 

(25) 



And 



Q h {Ja} <exp (-71 [A- 5,, 



(26) 



With 5 n =p\X\ 



logQ+i) 



0. 



The lemma does not claim that convexity holds with high 
probability, but rather that any positive deviation from con- 
vexity may happen only on a subset of x with vanishing 
probability. It is surprising that the bound does not depend 
on y, Q and the size of the subsets, and only weakly depends 
on the number of subsets. 
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Fig. 4. Illustration of bad sequences and lemma [5] 



Before proving the lemma we emphasize a delicate point: 
the lemma does not only claim that for each y the probability 
of deviation from convexity is small, but makes a stronger 
claim that apart from a subset of the x sequences with 
vanishing probability, convexity always holds independently 
of y. This distinction is important since this lemma defines 
a set of "bad" input sequences that fail our scheme. In these 
sequences there exists a partitioning that yields an excessive 
deviation from the distribution Q between rateless blocks. As 
an example of such a sequence consider the binary channel 
and the input sequence Q n / 2 i n / 2 (n/2 zeros followed by n/2 
ones). This sequence is bad since it guarantees that on one 
hand at most one block will be received (since at most one 
block includes both 0-s and 1-s at the input), but on the 
other hand the zero order empirical input distribution is good 
(Ber(^)), so potentially we have the combination of high 
empirical mutual information with low communication rate. 
The sequences that deviate from convexity are a function 
of the output y. Had we only bounded the probability of 
deviation from convexity to occur for each y individually, then 
a potential adversary could have increased this probability by 
determining y (given x) such that x will be a bad sequence 
with respect to this y. To avoid this, we claim that there is a 
fixed group of x such that if the sequence is not in the group, 
approximate convexity holds regardless of y. This is illustrated 
in fig.Q where the dark spots mark the pairs (x, y) for which 
convexity does not hold. 

Proof of lemma [3] Define the vector u denoting the sub- 
set number of each element = i Vfe G A^, Then 
I(x Ai ;y Ai ) = 7(x,y|u = i), and P u («) = therefore 
we can write the weighted sum of empirical mutual infor- 
mation over the partitions, as a conditional empirical mutual 
information: 



(—I^; yaS) = E ^(*K(x; y|u = i) = 

i=l \ ' i=l 

= /(x;y|u) (27) 



Using the chain rule for mutual information (see [14| section 
2.5): 

J(x; y) - J(x; y|u) = I (x; y) - (/(x; yu) - f(x; u)) = 

= /(x;u)-7(x;u|y) </(x;u) (28) 

Define the set Ja = {x : j(x; u) > A}, then 

Vxg- J A ,y:7(x;y)-7(x;y|u)<7(x;u)< A (29) 

And since x is chosen iid and u is a fixed vector, we have 
from Lemma [T] 



Pr (x e J a) < cxp (-n (a - 6 T 



(30) 



mth~S n = \X\\{l,---,P}\ l2E ^ 1 - □ 
Note that if the distribution of x is the same over all 

partitions then P(x|u) = P(x) therefore 7(x; u) = and 
the empirical mutual information will be truly convex. 

2) Likely convexity of the correlation factor: For the con- 
tinuous case we use the following property which somewhat 
parallels Lemma [5] The reasons for not following the same 
path as the discrete case will be explained in the sequel 



(subsection VI-Ci. Unfortunately the proof is very technical 



and less elegant and will therefore be expelled to the appendix 
(appendix-|E]). Note that again the bound does not depend on 
the size of the subsets. 

Lemma 6 (Likely convexity of p 2 ). Define {^4i}f = i as in 
Lemma [5] Let x , y be n-length sequences and define the 
correlation factors of the sub-sequences, and the overall 
correlation factor as 



\X-A z YA t 



\\YAj 



and 



|x T y| 



(31) 



respectively. Let x be drawn i.i.d from a Gaussian distribution 
x 7V"(0,P). Then for any < A < = there is a subset 
Ja C M n such that: 

Vx^Ajef: X2—p1>P 2 -& (32) 



i=l 



And 



Pr{x G J a} < 2 p e~ nA2 / 8 



(33) 



I.e. there is a subset with high probability on which the mean 
of the correlation factors does not fall considerably below the 
overall correlation factor. 

3) Likely convexity with dependencies: The properties of 
likely convexity defined in the previous sections pertain to 
a case where the partition of the n block is fixed and x is 
drawn i.i.d. However in the transmission scheme we described, 
the partition varies in a way that depends on the value of 
x (through the decoding decisions and the empirical mutual 
information), which may, in general, change the probability 
of the convexity property with a given A to occur. Although 
it stands to reason that the variability of the block sizes in 
the decoding process reduces the probability to deviate from 
convexity since it tends to equalize the amount of mutual 
information in each rateless block, for the analysis we assume 
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an arbitrary dependence, and assume that the size of the set 
J increases by factor of the number of possible partitions, as 
explained below. 

Denote a partition by it = {Ai}f =1 (as defined in Lemmas 
5]6 i and the group of all possible partitions (for a given 
encoder-decoder) by II. For each partition 7r from Lemmas 5]6 
there is a subset J(ir) with probability bounded by pj outside 
which approximate convexity (as defined in the lemmas) 
holds. Then approximate convexity is guaranteed to hold for 
x ^ J = J(tt), where the probability of the set J is 
•n-en 

bounded by the union bound: 



Pr(x E J) = Pr |J (x e JM) < |nj 



Pj 



(34) 



wen 



Now we bound the number of partitions. In the two cases 
we will deal with in section lVI-Cl the number of subsets can be 
bounded by p max , and all subsets but one contain continuous 
indices. Therefore the partition is completely defined by the 
start and end indices of p max — 1 subsets (allowed to overlap 
if there are less than p max subsets), thus |II| < n 2p ™' 



7 2p„ 



and we have 



Pr(J) < n 2pmax • pj = exp(2p max log(n)) • pj 



< 



(35) 



where p,j is defined in the previous lemmas. So for our 
purposes we may say that these lemmas hold even if the 
partition depends on x with an appropriate change in the 
probability of J. 

B. Error probability analysis 

In this subsection we show the probability to decode incor- 
rectly any of the B indices is smaller than P e . 

With R cmp defined in Eq.([20[>, we have from Lemmafflthat 
under the conditions of the lemma Pr(i? omp > t) = PrQ/pl > 
-R^ 1 ^)) < 2exp(— (n — l)t). Then combining Lemmas 111 and 
[4] we may say that for any y" 1 the probability of x" 1 generated 
i.i.d. from the relevant prior to have i? emp > t is bounded by: 

Q m (R C m P (x?,yT)>t) <exp(-(m-a)(t-<y m )) (36) 



where 



And 



S m = 



^Pjlog^+l) discrete 

log 2 
m— 1 



continuous 



discrete 

1 continuous 



(37) 



(38) 



An error might occur if at any symbol 1 < k < n an 



incorrect codeword meets the termination condition Eq.(23i 



The probability that codeword j ^ i meets Eq.(23 i at a specific 
symbol k which is the m-th symbol of a rateless block is 
bounded by: 



Pr (-RSn P ( x i,y) > /4J < exp(-(m - s)(fi* T: 

P P 



exp 



if + log 



nexp(_R") 



-5 m )) 

Pe 
Mn 



(39) 



The probability of any erroneous codeword to meet the 
threshold at any symbol is bounded by the union bound: 

Pr(error) < Pr J (J \J (^(x^y) > n* m ) \ < 

[ k=l j& J 

<n(M-l)-^<P e (40) 
Mn 

The first inequality is since the correct codeword might be 
decoded even if an erroneous codeword met the threshold. 
Although the index m in the expression above depends on 
k and the specific sequences x, y in an unspecified way, the 
assertion is true since the probability of the event in the union 
has an upper bound independent of m. 

C. Rate analysis 

Roughly speaking, since [i* m sa — , if no error occurs, the 
correct codeword crossed the threshold when R™ llp (xi, y) « 
— therefore the rate achieved over a rateless block is Rh = 
~ « i?™ np (xj,y), and due to the approximate convexity by 
achieving the above rate on each block separately we meet 
or exceed the rate i? mp(x,y) over the entire transmission. 
However in a detailed analysis we have the following sources 
of rate loss: 

1) The offsets inserted in /i* n to meet the desired error 
probability 

2) The offset from convexity (Lemma [5]l introduced by the 
slight differences in empirical distribution of x between 
the blocks 

3) Unused symbols: 

a) The last symbol of each block is not fully utilized, 
as explained below 

b) The last (unfinished) block is not utilized 
Regarding the last symbol of each block, note that after 

receiving the previous symbol the empirical mutual informa- 
tion is below the threshold, and at the last symbol it meets 
or exceeds the threshold. However the proposed scheme does 
not gain additional rate from the difference between the mutual 
information and the threshold, and thus it loses with respect 
to its target (the mutual information over the block) when this 
difference is large. Here a "good" channel works adversely 
to our worse. Since we operate under an individual channel 
regime, the increase of the mutual information at the last 
symbol is not bounded to the average information contents of 
a single symbol. This is especially evident in the continuous 
case where the empirical mutual information is unbounded. 
A high value of y together with high value of x at the 
last symbol causes an unbounded increase in i? cmp : if we 
choose x m ,y m — > oo then p — > 1 regardless of the history 
x^" , y^™ • Therefore over a single block we might have 
an arbitrarily low rate (|p| is small over the m — 1 first 
symbols) and arbitrarily large i? C mp- In the discrete case this 
phenomenon exists but is less accented (consider for example 
the sequences x = y = 0™ _1 1 = (0, ...,0,1)) Similarly 
regarding the last block, the fact that the length of the block 
may be bounded does not mean the increase in the empirical 
mutual information can be bounded as well. We use the 
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TABLE I 

Summary of definitions and references for the discrete and continuous cases 



Item 


Discrete case 


Continuous case 


Input distribution 


Any Q 


Q=M{0,P) 


Decoding metric 


R em p(x,y) = /(x, y) 


R emp (x,y)_ |log( 1 _. 2 1 ( ^ y) ) 


Decoder 


maximize _R cmp (x,y) maximize 

/(x,y) 


maximize i? cmp (x, y) maximize 

|p(x,y)| 


Pairwise error probability Pr(fi omp > t) 


< cxp(— n(t — <5 n )) (LemmaJTJ 


< 2exp(-(n - l)t) (Lemma pi 


Likely convexity condition (Vx Ja , y £ 
y n withAiS ^\Ai\) 


Hi-i A»/(xi;yi) > 7(x;y)-A (Lemma 
PI 


ELl > P 2 " A (Lemrni]6| 


Likely convexity probability (Pr(x J a), 
fixed partitioning) 


> 1 - cxp (-n (A - (5 n )) 


> 1 - 2f e -" A2 / 8 



approximate convexity (Lemma [5]l to show the last two losses 
are bounded for most x sequences. 

Note that by the same argument that shows the loss from 
not utilizing the last symbol vanishes asymptotically, it is 
easy to show that feeding back the block success information 
only once every 1/e symbols thereby decreasing the feedback 
rate to e does not decrease the asymptotical rate, since this 
is equivalent to having 1/e unused symbols instead of one. 
Hence the scheme can be modified to operate with "zero 
rate" feedback. Similarly the scheme can operate with a noisy 
feedback channel by introducing in the feedback link a delay 
suitable to convey the decoder decisions with sufficiently low 
error rate over the noisy channel. 

In addition to having rate losses the scheme also has a 
minimal rate and a maximal rate for each block length. The 
minimal rate is — resulting from sending a single block. If 
channel conditions are worse (i? e mp < — )> no information 
will be sent. A maximal rate exists since at best K information 
units could be sent every 2 symbols (since for the continuous 
case jit* = oo and for the discrete case Rl mp (x,y) = 
thus the decoding never terminates at the first symbol of 
the block), hence the maximum rate is ~. As n — ► oo we 
increase K so that the minimum rate (and the rate offsets) 
tend to and the maximum rate tends to oo. The maximum 
rate is the reason that the scheme cannot approach the target 
rate i? omp (xi,y) uniformly in x,y in the continuous case, 
since for some pairs of sequences the target rate (which is 
unbounded) may be much higher than the maximum rate. The 
rate R that we achieve in the proof of Theorem [4] is much 
smaller than the absolute maximum 4p. Note that successive 
schemes (such as Schalkwijk's 11221 s ) do not suffer from the 
problem of maximum rate. For the discrete case the target rate 
is bounded by max(|Af|, \y\) therefore for sufficiently large n 
the maximal rate y exceeds max(|A?|, |y|) and we are able 
to show uniform convergence. 

Although our target is the empirical mutual information over 
the n-block, an artifact of the partitioning to smaller blocks is 
that higher rates can be attained when the empirical conditional 
channel distribution varies over time, since by the convexity 
of mutual information with respect to the channel law the 
convex sum of mutual information over blocks exceeds the 
overall mutual information if these are not constant. 

We now turn to prove the achieved rate. The total amount 
of information sent (with or without error) is B ■ K therefore 



the actual rate is 



BK 

n 



(41) 



We now endeavor to show this rate is close to or better than the 
empirical mutual information in probability of at least P A over 
the sequences x, regardless of y and of whether a decoding 
error occurred. 

The following definition of index sets in {1, ...,n} is 
used for both the discrete and the continuous cases: Ub = 
{k} k b =£ 2 denotes the channel uses of block b except the 
last one, Lq collects the last channel uses of all the blocks 
Lq = {fcb — 1 : b > 1}, and Ub+i denotes the indices of 
the un-decoded (last) block Ub+i = {k}2=k B+1 (including 
its last symbol), and is an empty set if the last block is 
decoded. The sets {Ub}^J^, Lq are disjoint and their union is 
{1, . . . , n}. We denote the length of each block not including 
the last symbol by m& = \Ub\- From this point on we split 
the discussion and we start with the discrete case which is 
simpler. 

1 ) Rate analysis for the discrete case: We write /i* n as 

K+A n 



, K+A u ... 

< with 



X., =\^[y)+m6 m =log(^) |.v:;r; /,,,!! 



< log 



|#||y|log(n+l) 



(42) 



From Lemma [5] and Eq.(35i we have that the following 
equation: 

J(x;y) - A < £ (^i( XBb ;y B j) + Ml( XLo ;y Lo ) 

b=l 

(43) 

is satisfied when x is outside a set Ja with probability 
of at most cxp ^— n — $nj^J where S n = (B + 2)\X\ ■ 



log(n+l) 



2B n 



log(") 



. We shall find B n 



later on. To 



make sure the probability of J is less than Pa we require 
exp (— n ^A — $njj < Pa therefore 



A > 5„ log (Pa) = 

n 

- (B + 2)|*| • l ^±^ + 2B m J^ - 1 log (P A ) 
n n n 

(44) 
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and we choose 

A = (3B mi 



2)1^.^+11- I log (P.) (45) 
n n 



We now bound each element of Eq.(43i. Consider block 
b with mi, + 1 symbols. At the last symbol before decoding 
(symbol mj, = \Ub\) none of the codewords, including the 
correct one crosses the threshold n* m , therefore: 

K + A r , 



l<<, 



(46) 



Specifically for the unfinished block we have at symbol n 

* _ K + A 



1 m B + i 



> i{xu B+ i;yu B+x ) 



(47) 



rriB+i 

The way to understand these bounds is as guarantee on the 
shortness of the blocks given sufficient mutual information. 
On the other hand, at the end of each block including the last 
symbol (symbols + mb)), since one of the sequences 

was decoded we have: 



_ K + A mb+1 
~ m b + 1 S 



"m b +l 
< max/ 



( ( ~v \ k i+ m i 
\\ Xi )k b 



yt +mb )<logmm(\X\,\y\)=ho 



(48) 



Which we can use to bound the number of blocks, since nib - 
l> it therefore 



B 
6=1 



^(m b + l) < — 



Br, 



As for the unused last-symbols we bound: 

i^LoWLo) < h o 



Combining Eq.(49l and Eq.(45i we have 

'3h 



A < 



. - ) \X\-log(n+l)--log(P A ) 
K n I n 



(49) 



(50) 



(51) 



Combining Eq.(j46j»,(j47J>,([50(> with Eq.(|43]) and substituting 
A m < A M yields: 



B+l 



7(x;y)<A+^ 



TUb 

n 



6=1 
B+l 



+ A, 



B 

+ -h < 
n 



B 



-h 



6=1 

= A- 



B + l 



B . 



(K + A (1 ) + -h (52) 



From Eq.(52i B and consequently i? act can be lower 
bounded: 

R ^--' K > kTa—- ' = 



»,u t ho 
f(x;y)-A-f + 



1 



K 



(53) 



Now if we increase K with n such that 0(log(n)) < 
O(K) < O(n) (for example by choosing K = n a , < a < 



1), then — — > as n — > oo, since A M = 0(log(n)) we have 
^ — > and from Eq.|5l| we have A — > thus for any e we 
have n large enough so that: 



> 6 > (/(x;y) - e) (1 - e) > 

> 7(x; y) - (1 + h )e = i? cmp (54) 

Outside the set J, where the last inequality is due to the fact / 
is bounded. Hence we proved our claim that the rate exceeds 
a rate function which converges uniformly to the empirical 
mutual information and the proof of Theorem [3] is complete. 

□ 

2) Rate analysis for the continuous case: The continuous 
case is more difficult from several reasons. One is that the 
error probability exponent has a missing degree of freedom 
(« exp((n— X)t)). This results in a rate loss (through s in the 
definition of [i* m ), which is larger for small blocks, and can be 
bounded only when assuming the number of blocks does not 
grow linearly with n. Since the effective mutual information 
■Rcmp(x,y) is unbounded we cannot simply bound the loss 
of mutual information over the unused symbols. Specifically 
for a single symbol, p = 1 and i? cmp = oo. Therefore we 
use the convexity of the correlation factor and the fact it is 
bounded by 1. As a result, the loss introduced in order to 
attain convexity (over the rateless blocks) is in the correlation 
factor rather than the empirical mutual information. A loss 
in the correlation factor induces unbounded loss in the rate 
function for p w 1, leading to a maximum rate. In order to 
cope with these difficulties we use a threshold T on the number 
of symbols in a block (T is chosen to grow slower than n), 
and treat large and small blocks differently: the large blocks 
are analyzed through their correlation factor and for the small 
blocks the correlation factor is upper bounded by 1 and only 
the number of blocks is accounted for. 

We denote p b = p(xu b ,yu b ) an d P = p(x, y) the correla- 
tion factor measured on a rateless block and on the entire trans- 
mission block, respectively. We denote by B$ = {b : nib < T} 
and Bl = {b : mb > T} the indices of the small and the 
large blocks respectively (the last unfinished block included). 
The total number of symbols in the large blocks is denoted 
rriL = J^beB m b- The number of large blocks is bounded by 
\B L \<$. 

The decoding threshold is written as 



K 1 n 

T + T lo S 

m — 1 m — 1 \r e 



where we denoted A M = log 



log(2) _ K + A^ 



m — 1 



m — 1 



(55) 



2)1 



We consider the parti- 
tioning of the index set {1, . . . , n} into at most p = ^ sets: 
the first ^ — 1 (or less) sets are the large blocks except their 
last symbol {J b eB L Ub (each with at least T + 1 symbols by 
definition), and the last set denoted L\ includes the rest of the 
symbols (last symbols of these blocks and all symbols of small 
blocks), and has \L\ \ = n — m^. Since this partitioning has a 
bounded number of sets, by applying Lemma [6] and Eq.(35l 
with p ■■ 



we have that 



Eqj57] 



below is satisfied when x is 
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outside a set J with probability at most: 

2£ 



Pr(J) < n 2p ■ 2 p e~ n 



-nA 2 /8 



exp 



n(log(e)A 2 /8--log(72n) 



(56) 



For any < A < \. This bound tends to if T > 0(log(n)) 
(since log(e)A 2 /8 - flogfV^n) -> log(e)A 2 /8 > 0) 
therefore for any such A there is n large enough such that 
this probability falls below the required Pa- The convexity 
condition is: 



beB L 



— Pb 

n 



^-/3(x Ll ;y Ll ) 2 < 

m b -2 
< > Pb 



beB 



(57) 



where A can be made arbitrarily close to 0. We define a factor 
771 < 1 and apply the function (—3) log(l — rj\t) to both sides 
of the above equation. Since the function is monotonically 
increasing and convex U over t € [0,1) (stemming from 
concavity n of log(i)), we have: 



r o = (-5) log(l - Tji ■ (p 2 - A)) < 



(-l)log 



— Pb 
n 



rtiL 



^ E 

beB L 



m b 



E 

\beB L 

(-i)log(l-7 7l/ 5 2 ) + 



■ 1 



< 



(-i)log(l-7 ?1 -l) (58) 



We start by bounding the terms related to the large blocks. 
At the last symbol before decoding in each block (or symbol 
n for the unfinished block) none of the codewords, including 
the correct one crosses the threshold \i* m , therefore we have 
for b= 1,.. .,B + 1: 

P*m b = > fl emp(Xc/ & ,yc/J = -^log(l (59) 

and since m& > T + 1: 

7716 



(-i)log(l- m ^)<— (-^)log(l-p 2 )< 



77 



m 6 K + A M 
n mb — 1 



= 1 



1 

m b - 1 



A, 



< 



< 1 



K + A L 



(60) 



For the small blocks we use n < X^bes ( m b + 1) + 
E be B s ( m b + 1) < m L + \B L \ + (T + 1)1^1 (where the 
inequality is since the unterminated block has length m b ) to 
bound n - m L < \B L \ + (T + l)\B s \. 



Combining Eq.(58i with these bounds we have: 



+ 



\B L \ + (T+1)\B S \ 



(61) 



The last equation is a lower bound on a linear combination 
of \Bl\ and \Bs\- Since the total information sent depends on 
\Bl \ + \ Bs\ we equalize the coefficients multiplying \B^\ and 
\Bs\ by determining ?/i so that: 



1 



log (1 - T?!) = 1 



K + A L 



(62) 



This is always possible since the RHS is positive and the LHS 
maps 771 e (0, 1) to (0,oo). Then 



ro < IB 



\B L \ + (T + 1)\B S \ 



\B T .\ + \B, 



T 

K + A, 



1 + 



n k+a, 



(B + l) 



T j 

K + A„ 



(63) 



n n 
Extracting a lower bound on B from Eq.([6*3]l yields a bound 
on the empirical rate: 



R a 



> 



K 

n 
K 



B > 



ro ■ n 

K + A^ 7 l + K-^Aft 

K 



-1 = 



ro 



K 



(-Diogq-^^-A)) 

(l + K- 1 



= R 



LBl 



(64) 



Equation ( 64 1 may be optimized with respect to T to obtain a 
tighter bound, but this is not necessary to prove the theorem. 
Recall that A M = 0(log(n)). By choosing 0(log(n)) < 



K < 0{n) the factor (1 + K~ 1 A tl ) in Eq.(64i can be made 



arbitrarily close to 1 and — can be made arbitrarily close 
to 0. As we saw above choosing 0(log(n)) < T < 0(n) 
enables us to have Pa — * with A arbitrarily close to 0, and 



finally if K > 0(T) then the RHS of Eq.(|62] tends to 00 and 
therefore we can choose rji arbitrarily close to 1. Summarizing 
the above, by selecting 0(log(n)) < O(T) < 0(K) < 0(n) 
we can write the rate as 

i?act > Rlbi = (-1) log(l - m ■ (P 2 - A)) -V2-ei (65) 



With rji , i]2 — > 1 and e\ , A + . -Rlbi tends to the target 
rate i?2(p) = \ log ^^3^2) f° r eac h point p € [0, 1) (but not 
uniformly), and it remains to show that for any R, e there is n 
large enough such that -Rlbi > -Rlb2 = min(-R2(/5) — e. R). 

The functions i?2(p) and -Rlbi(p) are monotonically in- 
creasing (for fixed 771,772 and ei) and it is easy to verify 
by differentiation that the difference i?2(/o) — i?LBi(p) is 
also monotonically increasing. Given R, e, choose po such 
that -R 2 (po) = R + e. Since -Rlbi^cO^^I/Oo), for n 
large enough we have R2{pa) — -Rlbi(po) < £ , an< l there- 
fore -Rlbi(po) > R2(po) — e = R. For this n, for any 
P < Po from the monotonicity of the difference we have that 
R2(p) — -Rlbi(p) < e, and for any p > po we have from 
the monotonicity of -Rlbi(p) that -Rlbi(p) > -R, therefore 
.Rlbi > i?LB2, which completes the proof of Theorem |4] □ 

VII. Examples 

In this section we give some examples to illustrate the model 
developed in this paper. In this section we use a slightly less 
formal notation. 
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A. Constant outputs and other illustrative cases 

The statement that a rate which is determined by the input 
and output sequences can be attained without assuming any 
dependence between them may seem paradoxical at first. Some 
insight can be gained by looking at the specific case where the 
output sequence is fixed and does not depend on the input. In 
this case, obviously, no information can be transferred. Since 
the encoder uses random sequences, the result of fixing the 
output is that the probability to have an empirical mutual 
information larger than e > tends to 0, therefore most of 
the time the rate will be 0. Infrequently, however, the input 
sequence accidentally has empirical mutual information larger 
than e > with the output sequence. In this case the decoder 
will set a positive rate, but very likely fail to decode. These 
cases occur in vanishing probability and constitute part of 
the error probability. So in this case we will transmit rate 
R = with probability of at least 1 — P e and R > with 
probability at most P e . Conversely, if the channel appears to 
be good according to the input and output sequences (suppose 
for example yk = Xk), the decoder does not know if it 
is facing a good channel or just a coincidence, however it 
takes a small risk by assuming it is indeed a good channel 
and attempting to decode, since the chances of high mutual 
information appearing accidentally are small (and uniformly 
bounded for all output sequences). 

Another point that appears paradoxical at first sight is that 
the decoder is able to determine a rate R > i? e mp without 
knowing x for any x ^ J. First observe that although it 
is an output of the decoder, the rate R is not controlled by 
the encoder and therefore cannot convey information. Since 
the decoder knows the codebook, and given the codebook the 
sequence x is limited to a number of possibilities (determined 
by the possible messages and block locations), it is easy to 
find an R(y) > -R erop (x, y) by maximizing i? cmp over all 
possible sequences x. Vaguely speaking, the decoding process 
is indeed a maximization of i? C mp over multiple x sequences 
and by Lemmas [T|[4] such a decoding process guarantees small 
probability of error. 

B. Applying the continuous alphabet scheme to other input 
alphabets 

The scheme used for the continuous case can be adapted 
to peak limited or even discrete input, by using an adaptation 
function, i.e. the channel input will be x' k = /(xfe). In this 
case the modified codebook C = f(C) will be generated 
by passing the Gaussian codebook through the adaptation 
function, but for analysis purposes the adaptation function 
/(■) may be considered part of the channel and the correlation 
factor is calculated with respect to x which is used to generate 
the codebook. In order to write the rate guaranteed by this 
approach as a function of x' rather than x, the law of large 
numbers has to be utilized (in general) with respect to the 
distribution Pr(xk\x' k ). 

C. Non linear channels 

In analyzing probabilistic channels, the correlation model 
determines the rate | log ( x ^ p2 ^ is always achievable using 



Gaussian code (no randomization is needed if the channel is 
probabilistic as can be shown by the standard argument about 
the existence of a good code). This is actually a result of 
Lemma |2] 

This expression is useful for analyzing channels in which 
the noise is not additive or non linearities exist. As an example, 
transmitter noise is usually modeled as an additive noise. 
However large part of this noise is due to distortion (e.g. in 
the power amplifier), and therefore depends on the transmitted 
signal and is inversely correlated to it. Consider the non linear 
channel Y = f(X) + V with V ~ Af(0,N). In this case 



if we define the effective SNR as SNR 



i-p 



j then rate 



R = h log (1 + SNR) is achievable. The correlation factor is: 



p 2 = E{XYf 



E{Xf{X)f 



E(X 2 )E(Y 2 ) E{X 2 ){E{f{X) 2 ) + N) 
Therefore the effective SNR is: 



(66) 



SNR 



l-p 2 



E{Xf{X)f 



P 



cir 



E{X 2 ){E{f{X) 2 ) + N)- E{Xf{X)Y N + 7V cff 

(67) 

where we defined the effective gain 7, the effective power P e g 
and the effective noise N e g as: 



7 

PcS 



= E{Xf{X)) 
E(X 2 ) 
(E[(Xf(X)}) 2 
E(X 2 ) 



(68) 

= E [{iX) 2 \ (69) 



£[(/P0-7*) 2 



E{X 2 ) 



(70) 



This yields a simple characterization of the degradation caused 
by the non linearity, which is independent of the noise power 
and is tight if the non linearity is small. This model enables to 
characterize the transmitter distortions by the two parameters 
P e g,JV e ff, a characterization which is more convenient and 
practical to calculate than the channel capacity, and on the 
other hand guarantees that transmitter noise evaluated this way 
never degrades the channel capacity in more than determined 
by Eq.((67). 

Another interesting application of this bound is in treating 
receiver estimation errors, since it is simpler to calculate the 
loss in the correlation factor induced due to the imperfect 
knowledge of the channel parameters than the loss in capacity. 
For example, the bound in [16| for the loss due to channel 
estimation from training, when specialized to single input 
single output (SISO) channels, may be computed using the 
correlation factor bound. 

D. Employing continuous channel scheme over a BSC 

When operated over a channel different than the Gaussian 
additive noise channel, the rates achieved with the scheme 
we described in the continuous case are suboptimal compared 
to the channel capacity. The loss depends on the channel in 
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Fig. 5. Comparison of C,R for the BSC 



question. As an example, suppose the communication system 
is used over a BSC with error probability e, i.e. the continuous 
input value X is translated to a binary value by sign(X), and 
the output is Y = sign(X) • (— l) Ser ( c ). The capacity of this 
channel is C = lbit — hb(e) and we are interested to calculate 
the rate which would be achieved by our scheme (which 
does not know the channel) for this channel behavior. For 
this channel with Gaussian Af(0, P) input we have (through a 
simple calculation): 



E{XY) = (1 - 2e)i 



Hence 



E{XY) 2 
P-E(l 2 ) 



~(l-2e) 



And 



H l0S (l-f(l- 2 e) 2 



(71) 



(72) 



(73) 



The comparison between C and R is presented in fig.Q. It 
can be shown that R> -C, thus the maximum loss is 36%. 



E. Channels that fail the zero order and the correlation model 

Although we did not assume anything about the channel, 
and specifically we did not assume the channel is memoryless, 
the fact we used the zero-order empirical distribution means 
the results are less tight for channels with memory. Specifically 
if delay is introduced then the scheme would fail completely. 
For example, for the channel yk = Xk + \xk-i + Vk we 
would obtain positive rates and the intersymbol interference 
(ISI) |aife_i would be treated (suboptimally) as noise, but for 
the error free channel yk = Xk-i the achieved rate would be 
(with high probability). Similarly we can find a memoryless 
channel with infinite capacity but for which the correlation 
model we used for the continuous alphabet scheme fails: if 
Vk = x \ then ,o = 0. Another example of practical importance 
is the fading channel (with memory) y n = h n x n + v n , where 
h n is slowly fading with mean 0. All these examples result 



from the simplicity of the models used, and can be solved by 
schemes employing higher order empirical distributions (over 
blocks, or by using Markov models), and by employing tighter 
approximations of the empirical statistics (e.g. by higher order 
statistics) in the continuous case. 

F. Using individual channel model to analyze adversarial 
individual sequence 

As we noted in the overview, the results obtained for 
the individual channel model constitute a convenient starting 
point for analyzing channel models which have a full or 
partial probabilistic behavior. It is clear that results regarding 
achievable rates in fully probabilistic, compound, arbitrarily 
varying and individual noise sequence models can be obtained 
by applying the weak law of large numbers to the theorems 
discussed here (limited, in general, to the randomized encoders 
regime). 

E.g. for a compound channel model Wg(y\x) with 
an unknown parameter 9 since P(x;y) — >Pg(x,y) = 
Wg(y\x)Q(x) in probability for every 8 and since /(•;■) is 
continuous 7(x; y) — >Ig(X; Y). Hence from Theorem 1 rate 
ming Ig(X;Y) can be obtained without feedback, andrrom 
Theorem [3] rate Ig(X;Y) can be obtained with feedback. 
These results are not new (see |23"ll24l for the first and the 
second is obtained as a special case of the results in Q and 13 
since the individual noise sequence model can be degenerated 
into a compound model) and are given only to show the ease 
of using the individual model once established. 

To show the strength of the model we analyze a problem 
considered also in |Z) of an individual sequence which is 
determined by an adversary and allowed to depend in a fixed 
or randomized way on the past channel inputs and outputs. 
For simplicity we start with the binary channel yk — Xk © &k 
where e/. is allowed to depend on x^ _1 and y^ 1 (possibly 
in a random fashion), and the target is to show the empirical 
capacity is still achievable in this scenario. Note that here 
Ek is a random variable but not assumed to be i.i.d. We 
denote the relative number of errors by e = i Sfc=i e fe- We 
would like to show the communication scheme achieves a 
rate close to lbit — hb(e) in high probability, regardless of 
the adversary's policy. Note that both the achieved rate and 
the target lbit — hb(e) are random variables and the claim is 
that they are close in high probability (i.e. that the difference 
converges in probability to when n — > 00) 

Applying the scheme achieving Theorem [3] with Q = 
Ber(^) we can asymptotically approach (or exceed) the rate: 

/(x;y) = H(y) - i?(y|x) = H(y) - £(e|x) > 

> H(y) - H(e) = H(y) - h b {e) (74) 

Note that unlike in the probabilistic BSC where we have 
I(X; Y) = H(Y)-H{E), here the empirical distribution of e 
is not necessarily independent of x, therefore the entropies are 
only related by the inequality H(e\x) < H(e) (conditioning 
reduces entropy). In order to show a rate of lbit — hb(e) is 
achieved, we only need to show H(y) — y lbit- Since X). 
is independent of X*~ ,Y*~ and therefore also of Ek we 
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have: 

Pr(y fc = Ollf- 1 ) = ^Pr(F fe = 0\Y*- l ,e k )Pi{e k ) = 

efc 

= J2 Pr ( X k = e k \Y k -\e k )Pr(e k ) = 
= J2 Pl '(^ = e fe )Pr(e fe ) = ]T ipr(e fe ) = i (75) 

eib efc 

Therefore Y[ L is distributed i.i.d. Ber(^) and from the law of 
large numbers and the continuity of H(-) we have the desired 
result. This result is a special case of the results in 0. 

We can extend the example above to general discrete chan- 
nels and perform a consolidation of the adversarial sequence 
model considered in [2 1 (for modulu additive channels) with 
the general discrete channel with fixed sequence considered 
in 0. We address the channel W s (y\x) with state sequence 
s k potentially determined by an adversary knowing all past 
inputs and outputs. We would like to show that the rate 
I(Q,J2 S W s (y\x)P a (s)) (the mutual information of the state- 
averaged channel) can be asymptotically attained in the sense 
defined above. 

This result is a superset of the results of and 0. It 
overlaps with in the case s is a fixed sequence and with 
[2 | for the case of modulu-additive channel (or when the target 
rate is based on the modulu additive model). 

Since Theorem [3] shows the rate J(x; y) = 7(P(x), P(y|x)) 
can be approached or exceeded asymptotically, it remains 
to show that the empirical distribution P(x,y) is asymptot- 
ically close to the state-averaged distribution P aV g(x,y) = 
EsW s (y\x)P s (s)Q(x) = ^E k W Sk (y\x)Q(x), and the re- 
suit will follow from continuity of the mutual information. 
Note that the later value is a random variable (function) 
depending on the behavior of the adversary. Here we do not 
use the law of large numbers because of the interdependencies 
between the signals x,y and s. 

Our purpose is to prove that the difference A(t,r) defined 
below converges in probability to for every t, r: 

A{t,r)=P (7Cty) (t,r)-P avg {t,r) = 
= *J2 Ind (^ = t,Y k = r)--J2 W Sk (r\t)Q(t) = 

n ^ — ' r? — ' 



= ±yV(t,r) (76) 

71 < 



where <p h (t,r) = Ind(X fe = t,Y k = r) - W Sk (r\t)Q(t). For 
brevity of notation we omit the argument (t, r) from ip k (t,r) 
since from this point on it takes a fixed value. Then 

E(lnd(X k = t,Y k = r)\X k -\Y k -\S k ) = 

= Px{X k = t, Y k = r\X k -\Y k -\S k ) = 

= Pr(X k = t\X k -\Y k ~\S k )- 
■Pr(Y k = r\X k = t,X k -\Y k -\S k ) = 

C => Pr(X k = t) ■ Pr{Y k = r\X k = t, S k ) ( => Q(t)W Sk (r\t) 

(77) 



where (a) is due to the independent drawing of X k (when not 
conditioned on the codebook), the fact S k is independent of 
X k , and the memoryless channel (defining the Markov chain 
(X k - 1 ,Y k ~ 1 ,S k - 1 ) ^ (X k , S k ) <-> Y k ), and (b) is due to the 
i.i.d drawing of X k from Q and the definition of W. From 



Eq.(77i we have that: 

E{ Vk \X k -\Y k -\S k ) = 



(78) 



By the smoothing theorem we also have that ip k has zero 
mean E(ip k ) — 0. We now show that ip k are uncorrected. 
Consider two different indices j < k (without loss of gener- 
ality) then 

E(<p k .p j )=E[E(<p k .<p j \X k - 1 ,Y k - 1 ,S k )] - 

= E[ip r E{ Vk \X k -\Y k -\S k )} =0 (79) 

where we used the smoothing theorem and the fact tpj is 
completely determined by Xj , Yj , Sj which are given. In 
addition since by definition — 1 < ip k < 1, E(ip^) < 1. 
Therefore 



E ^ 2 ) = ^ E E fa ■ 4^ s > k 

j,k—l j.k—1 

and by Chebychev inequality for any e > 0: 

1 



1 

n 



(80) 



Pr(|A(t,r)| >e) < < — 2 — ,0 (81) 

which proves the claim. □ 
This result is new, to our knowledge, however the main 
point here is the relative simplicity in which it is attained when 
relying on the empirical channel model (note that most of the 
proof did not require any information-theoretic argument). 

VIII. Comments and further study 
A. Limitations of the model 

The scheme presented here is suboptimal when operated 
over channels with memory or, in the continuous case over non 
AWGN channels, and in section |VII-E| we discussed several 
cases where the communication fails completely. Obviously 
the solution is to extend the time order of the model. A simple 
extension is by using the super-alphabets X p and y v and treat- 
ing a block of channel uses as one symbol. A more delicate 
extension is by considering a Markov model (the p-th order 
empirical conditional probability P(x fc ,j/ fc |x^Ip,y^Ip)). 

For the continuous channel we focused on a specific class 
of continuous channels where the alphabet is the real numbers 
(we have not considered vectors as in MIMO channels), and 
we did not achieve the full mutual information. A possible 
extension is to find measures of empirical mutual information 
for the continuous channels which are also attainable and 
approach the probabilistic mutual information for probabilistic 
channels. The current paper exhibits a considerable similarity 
between the continuous case and the discrete case which is not 
fully explored here, and a unifying theory which will include 
the two as particular cases is wanting. 

We conjecture that the following definition of empirical 
mutual information may achieve these goals: given a family 
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of joint distributions (not necessarily i.i.d) {Pg(x,y),9 6 0} 
define the entropy with respect to the family O as the entropy 
of the closest member of the family (in maximum likeli- 
hood sense): He(x) = mirage — h log Po( x ) ar, d likewise 
_ffe(x|y) = m i n eee — ^ logPg(x|y), and define the relative 
mutual information as 7e(x;y) = Hq(x) — /fe( x |y)- This 
definition corresponds to our target rates for the discrete case 
(with as the family of all DMC-s) and continuous case (with 
the family of all joint Gaussian zero-mean distributions 
AT(0,Axy)). 

B. Overhead and error exponent 

Another aspect is the overhead associated with extending the 
empirical distribution ("channel") family which is considered 
(both in considering time dependence and in increasing the 
accuracy with which the distribution is estimated or described). 
This overhead is related to the redundancy or regret associated 
with universal distributions (see [25]). Although we haven't 
performed a detailed analysis of the overheads and considered 
only the asymptotically achievable rates, it is obvious from 
comparing Lemmas [T] and [4] that the tighter rates we obtained 
for the discrete channel come at the cost of additional overhead 
(0(log(n)) compared to 0(1) in the continuous case) which is 
associated with the richness of the channel family (describing 
a conditional probability as opposed to a single correlation 
factor). Thus for example for a discrete channel with a large 
alphabet and a small block size n we would sometimes be 
better off using the "continuous channel model" version of 
our scheme (gaining only from the correlation) rather than 
the scheme of the discrete case (gaining the empirical mutual 
information). The issue of overheads requires additional anal- 
ysis in order to determine the bounds on the overheads and 
the tradeoff between richness of the channel family and the 
rate, for a finite n. As we noted in section |VI-C2| the bounds 
we currently have for the rate-adaptive continuous case are 
especially loose and call for improvement. 

Since rate can be traded off for error probability, a related 
question is the error exponent. Here, a good definition is still 
lacking for variable rate schemes, and the error exponents 
are not known for individual channels. The scheme we de- 
scribed does not endeavor to attain a good error exponent. 
Specifically, since the block of n channel uses is broken 
into multiple smaller blocks, it is probably not an efficient 
scheme in terms of error rate. We note, however, that for 
rate adaptive schemes with feedback a good error exponent 
does not necessarily relate to the capability of sending a 
message with small probability of error, but rather to the 
capability to detect the errors. A similar situation occurs in 
the setting of random decision time considered by Burnashev 
021 • In the later, an uncertainty of the decoder with respect 
to the message is mitigated by sending an acknowledge / 
unacknowledge (ACK/NACK) message and possibly repeating 
the transmission with small penalty in the average rate (see a 
good description in ifTTIl sec IV.B). A similar approach can 
be used in our setting (fixed decoding time, variable rate), by 
sending an ACK/NACK over a fixed portion of the block and 
setting R = when the decoder is not certain of the received 



message. However we did not perform a detailed analysis. 
Note also that the analysis of the probability Pa to transmit at 
a rate lower than the target rate function is entangled with the 
error analysis, since by such schemes it is possible to trade off 
rate for error, and reduce the error probability at the expense 
of increasing the probability to fall short of the target rate. 

C. Determining the behavior of the transmitted signal (prior) 

In this work we assumed a fixed prior (input probability 
distribution) and haven't dealt with the question of determining 
the prior, or more generally, how the encoder should adapt 
its behavior based on the feedback. Had the channel been 
a compound one, it stands to reason that a scheme using 
feedback may estimate the channel and adjust the input prior, 
and may asymptotically attain the channel capacity. However 
in the scope of individual channels (as well as individual 
sequence channels and AVC-s) it is not clear whether the 
approach of adjusting to the input distribution to the measured 
conditional distribution is of merit, if the empirical channel 
capacity can be attained for every sequence, and even the 
definition of achievability is unclear if the input distribution 
is allowed to vary. 

Another related aspect is what we require from a communi- 
cations system when considered under the individual channel 
framework. This question is relevant to all the requirements 
defined in the theorems (for example is the existence of 
the failure set J necessary ?), however the most outstanding 
requirement is related to the prior. 

Currently we constrained the input sequence to be a random 
i.i.d. sequence chosen from a fixed prior, which seems to be 
an overly narrow definition. The rationale behind this choice 
is that without any constraint on the input, the theorems 
we presented can be attained in a void way by transmitting 
only bad (e.g. fixed) sequences that guarantee zero empirical 
rate. Furthermore, without this constraint, attainability results 
for probabilistic models, and in general any attainable rates 
which are not conditioned on the input sequence could not 
be derived from our individual sequence theorems. A weaker 
requirement from the encoder is to be able to emit any 
possible sequence, however this requirement is not sufficient, 
since from the existence of such encoders we could not infer 
the existence of encoders achieving any positive rate over a 
specific channel. Consider for example the encoder satisfying 
the requirement by transmitting bad sequences in probability 
1 — e and good sequences in probability e — > 0. Theorems 
1|2|3 and [4] are existence theorems, i.e. they guarantee the 
existence of at least one system satisfying the conditions. 
Had we removed the requirement for fixed input prior we 
saw these theorems would be attained by encoders that are 
unsatisfactory in other aspects. Once the theorem is satisfied 
by one encoder it cannot guarantee the existence of other 
(satisfactory) encoders, thus making it un-useful. Therefore 
the requirement for fixed prior is necessary in the current 
framework. Although in the scope of the theorems presented 
here, this requirement only strengthens the theorems (since 
it reveals additional properties of the encoder attaining the 
other conditions of the theorem), we are still bothered by 
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the question what should be the minimal requirements from a 
communication system, and these hopefully will not include 
a constraint on the input distribution. 

This issue relates to a fundamental difficulty which aries 
in communication over individual channels: unlike universal 
source coding in which the sequence is given a-priori, here 
the sequences are given a-posteriori, and the actions of the 
encoder affect the outcome in an unspecified way. Currently 
we broke the tie by placing a constraint on the encoder, but 
we seek a more general definition of the problem. 



coding (as in lfl2ll |[T9]D in which the block size is not fixed but 
determined by the decoder. We did not include this scenario 
since the achievability result is less elegant in a way: the 
decoder indirectly affects the target rate (mutual information) 
through the block size. On the other hand this case may be 
of practical interest. Clearly the mutual information can be 
asymptotically attained for this communication scenario as 
well and its analysis is merely a simpler version of the rate 
analysis performed in section VI-C since convexity is not 
required. 



D. Amount of randomization 

We have assumed so far there is no restriction on the amount 
of common randomness available and have not attempted 
to minimize the amount of randomization required (while 
maintaining the same rates). It is shown in [2 1 that less than 
0(n) of randomization information is required in some cases 
and 0(n) is enough for others (see section V.5 therein), 
whereas we have used at least 0(M ■ n) > 0(n 2 ) random 
drawings to produce the codebook. 

E. Practical aspects 

The scheme described in this work is a theoretical one, 
but the concept appears to be extendable to practical coding 
systems. Below we focus on the continuous case and merely 
give the motivation (without proof). One may replace the 
correlation receiver (GLRT) by a receiver utilizing training 
symbols to learn the channel effective gain, and then apply 
maximum likelihood (or approximate, e.g. iterative) decoding. 
The randomization of the codebook may be replaced by using 
a fixed code with random interleaving, since with random 
interleaving only the empirical distribution of the (effective) 
noise sequence affects the error probability, and we may 
conjecture that the property that Gaussian noise distribution 
is the worst is approximately true for practical codes (such 
as turbo codes and LDPC). When using a random interleaver 
the training symbols as well as the part of the coded symbols 
can be interleaved together, and the decoding attempts (which 
occur every symbol in the theoretical scheme) occur only at the 
end of each interleaving block. The rateless code is replaced 
by an incremental redundancy scheme, i.e. by sending each 
time part of the symbols of the codeword, and repeating the 
codeword if all symbols were transmitted without successful 
decoding. The decision when to decode can be simply replaced 
by decoding and using a CRC check. Finally the common 
randomness (required only for the generation of the interleaver 
permutation) can be replaced by pseudo-randomness. Such a 
scheme may not be able to attain the promise of Theorem 
|4] for every individual sequence but may be able to adapt to 
every natural and man-made channel. 

F. Random decision time 

In our discussion we have described two communication 
scenarios: fixed rate without feedback and variable rate with 
feedback, and in both we assumed a fixed block size n. 
Another scenario is that of random decision time or rateless 



G. Bounds 

In this paper we focused on achievable rates and did not 
show a converse. An almost obvious statement is that any 
continuous rate function which depends only on the zero-order 
empirical statistics / correlation (respectively) cannot exceed 
asymptotically the rate functions of Theorems [3] ^respectively 
with vanishing error probability. To show the statement for 
the discrete case determine y using a memoryless channel 
W(y|x). Then by the law of large numbers the empirical 
distribution converges to the channel distribution and from the 
continuity of the rate function the empirical rate converges to 
the rate function taken at the channel distribution. Since by 
Theorem [3] the actual rate asymptotically meets or exceeds 
the rate function, and by the converse of the channel capacity 
theorem the actual rate cannot exceed (asymptotically) the 
mutual information, we have that the rate function cannot 
exceed the mutual information (i? cmp < R ac t < I{P> W)), 
up to asymptotically vanishing factors. For the continuous 
case the analogue claim is shown by taking a Gaussian 
additive channel and replacing "distribution" by "correlation" 
and "empirical mutual information" by — | log(l — p 2 ). The 
same applies also to rate functions obeying the conditions of 
Theorems [T] [2] More general bounds are yet to be studied. 

H. Comparison of the rate adaptive scheme with the similar 
scheme in j\3]l 

As noted the rate adaptive scheme we use is similar to the 
scheme of E| in its high level structure. Table [II] compares 
some attributes of the schemes. 

Another important factor is the overhead (i.e. the loss in 
number of bits communicated with a given error exponent, 
compared to the target rate), which we were unable to 
compare. We conjecture that the current scheme may have 
a lower overhead due to its simplicity which results in a 
smaller number of parameters and constraints on their order 
of magnitude (compared to the scheme of f3| where relations 
between factors such as number of pilots and the minimum 
size of a chunk may require a large value of n). 

IX. Conclusion 

We examined achievable transmission rates for channels 
with unspecified models, and focused on rates determined 
by a channel's a-posteriori empirical behavior, and specifi- 
cally on rate functions which are determined by the zero- 
order empirical distribution. This communication approach 
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TABLE II 

Comparison rate adaptive schemes in current paper and (3) 



Item 


Eswaran et al 1 3 1 


Current Paper 


Comments 


Channel model 


Individual sequence 


Individual channel 




Mechanism tor adaptivity 


Repeated instanced of rateless cod- 
ing 


Repeated instanced of rateless cod- 
ing 




Transmit format 


Total time divided to rounds (sate- 
less blocks) which are divided to 
chunks 


Total time divided to rateless 
blocks 


Chunks in [;3| used as feedback 
instances and expurgated code has 
constant type over chunks 


Feedback 


Ternary (Bad Noise/Decoded/Keep 
Going), once per chunk 


Binary (Decoded/Not Decoded) 
per symbol 


Easy to generalize to once every 
l/e symbols (see|VI-C| 


Alphabet 


Discrete 


Discrete or Real valued 




Training 


Known symbols in random loca- 
tions in each chunk 


None 




Randomness 


Full (0(exp(nR))) 


Full (0(exp(n_R))) 


Might be reduced by selection from 
a smaller collection of codebooks 
(in both cases) 


Codebook construction 


Constant composition + expurga- 
tion + training insertion 


Random i.i.d. 




Stopping condition 


Threshold over mutual information 
of channel estimated from training 


Threshold over empirical mutual 
information of best codeword 




Decoding 


Maximum (empirical) mutual in- 
formation 


Maximum (empirical) mutual in- 
formation 




Stopping location 


End of Chunk 


Any symbol 





does not require a-priori specification of the channel model. 
The main result is that for discrete channels the empirical 
mutual information between the input and output sequences is 
attainable for any output sequence using feedback and com- 
mon randomness, and for continuous real valued channels an 
effective "Gaussian capacity" — — p 2 ) can be attained. This 
generalizes results obtained for individual noise sequences and 
is a useful model for analyzing compound, arbitrarily varying, 
and individual noise sequence channels. 
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Appendix 



A. Proof of Lemma 1 

The proof is a rather standard calculation using the method 
of types. We use the notations of iflOl . We divide the se- 
quences according to their joint type Txy- The type Txy is 
defined by the probability distribution T XY € V^Xy). For 
notational purposes we define the dummy random variables 
(X,Y) ~ Txy and Tx, Ty, T Y \x as the marginal and con- 
ditional distributions resulting from Txy- Following |10|, the 
conditional type is defined as 7x|y(y) = {y : (x, y) E Txy}- 
The empirical mutual information of sequences in the type 
T XY is simply I{X; Y) = I(T Y ,T Y \ X ). Define T t = {T XY £ 
V n {Xy) : I{T Y ,T Y \x) > t}- Since all sequences in the 



(b) 



conditional type have the same (marginal) type, we can write: 
Q n (/(x;y) > i) = (Tx\ Y (y)) = 

^J2\ r x\Y(y)\^p{-n[H(Tx)+D(T x \\Q)}}< 

< ^exp (nH(X\Y)\ exp {-n \h(X) + D(T x \\Q)]\ = 

T t 

= J2^p{-n[l(X;Y)+D(T x \\Q)]}< 

T t 

< |P»(*y)|-exp j-n (min [l(T Y ,T x]Y ) + D(T X \\Q)] 

< (n + l) lxim -exp(-nt) = 
= exJ-n(t-\X\\y\ l ^±V)\ (82) 



where (a) is due to IflOl Eq.(II.l), (b) results from eq.(83l 
below which is an extension of (II.4) there to conditional types 
(and is a stronger version of Lemma II. 3), based on the fact 
that in the conditional type Tx\Y(y) the values of x over 
the n a = nT Y (a) indices for which j/j = a have empirical 
distribution T X \y an d therefore the number of such sequences 
is limited to exp (ji a H(X\Y = a) ), hence: 



\T xlY (y)\ < n^xp (nT Y {a)H{X\Y = a)) = 

a 

= exp (nH(X\Y) ) (83) 



(c) is based on bounding the number of types (see [14|, 
Theorem 11.1.1), and the fact that in the minimization region 
I(T y ,T x \y) > t and D(T X \\Q) > therefore the result of 
the minimum is at least t. 
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B. Discussion of Lemma 1 

1) An alternative proof for the exponential rate: For the 
proof of Theorem [T] we do not need the strict inequalities and 
equality in the error exponent would be sufficient, however 
these will be useful later for the rateless coding. An explana- 
tion for the fact that the result does not depend on Q can be 
obtained by showing that the above probability can be bounded 
for each type of x separately. I.e. if x is drawn uniformly over 
the type T x the probability of the above condition is: 

E \ r x\r{y)\ £ eMnH(X\Y)) 

\Tx\ ~ exp(n#(X)) 

= exp(-ra/(X;y)) = exp(-nt) (84) 

T XY £T t 

where T t = {T XY e V n (Xy) : (T XY ) X = T x , (T XY ) Y = 
Ty, I(Ty,Ty\x) > ^} an d since drawing x <~ Q n is 
equivalent to first drawing the type of x and then drawing 
x uniformly over the type, the bound holds when x ~ Q n . 

2) Extension to alpha receivers: Following we discuss 
an extension of the bound and relate it to Agarwal's 
coding theorem using the rate distortion function. Consider 
a communication system similar to that of Theorem [T] where 
the codebook is a constant composition code, consisting of 
randomly selected sequences of type Q, and the receiver is 
an a receiver (see 11261 ). i.e. selects the received codeword 
by maximizing a function a(x, y) depending only on the 
joint empirical distribution of the sequences x, y. The function 
a(X,Y) = a(T XY ) is defined as the respective function of 
the distribution of X,Y. Then, the pairwise error probability 



may be bounded similarly to eq. ( 84 1 by replacing the condi- 
tion the condition I(T Y ,T Y \ X ) > t in the definition of T t by 
a{T XY ) > t, and obtaining: 



Pr(d(x,y) >t) <P a 
= exp 

< exp 



min I(X;Y)\ 

y~P(y) 
a(X,Y)>t 



min I(X;Y) 

P^-.X^Q 
a{X,Y)>t 



(85) 



Following the proof of Theorem [T[ the RHS of eq.(85l 
determines the following achievable rate: 



#emp(x,y) 



mm 

X~Q, 
a(X,Y)>&(x,y) 



I(X;Y) <J(x,y) (86) 



Where the approximate inequality stems from substituting 
the empirical distribution of x,y as a particular distribution 
of X,Y meeting the minimization constraints. The above 
expression is similar to the one obtained in mismatch decoding 



with random codes. Eq.(85i allows a larger (but still limited) 
scope of empirical rate functions, but also shows that within 



this scope the best function is still the empirical mutual infor- 
mation. On the other hand, an advantage of this expression is 
that under some continuity conditions it can be extended from 
discrete to continuous vectors (as performed in (8)). 

When substituting a with the distortion function a(X, Y) — 
—Ed(X, Y), we would obtain: 



-Rem P (x,y) 



mm 

x~Q, 

Ed(X,Y)>Ed(x,y) 



I(X;Y) = 



= R x {Ed(xi,y)) = R X (D) (87) 

where R X (D) is the rate distortion function of an i.i.d. source 
X ~ Q with the distortion metric d. The later relation can 
be used to show the result that communication at the rate 
R X (D) is possible where D is the empirical or the maximum 
guaranteed distortion of the channel as shown in J8j . On the 
other hand, when using the correlation function a(X, Y) = 
fo X ,^->^ = P, we would obtain from eq.J86 » and Lemma 

E(X- ! )E(Y 2 ) r 1 1 

[2] i?cm P (x, y) w — \ log(l — p 2 ). Note that although the later 
expression is the same as the one obtained in Theorem [2] the 
above derivation only proves it for discrete vectors. 

C. Proof of Lemma 2 

For random variables X and Y where X is continuous (not 
necessarily Gaussian) we have the following bound on the 
conditional differential entropy (Y denotes a dummy variable 
with the same distribution as Y and used for notational 
purposes): 



h(X\Y) = 



(a) 

< E 



h (X\Y = Y 
1 



< 



log(27reV AR(X\Y)) 



< 



(b) 1 

< -log(2neE[VAR(X\Y)]) = 
= - log (2TreE [VAR(X - a ■ Y\Y)]) < 

(c) 1 

< -log(27re£:(X-a-r) 2 ) = ._ E( xy) 



hog (2*eE{X 2 ){\ - p 2 )) 



(88) 



^log (2i:eE{X 2 ))+ l -\og(l- p 2 ^ 

where the (a) is based on Gaussian bound for entropy and (b) 
on concavity of the log function (see also [14| Eq.(17.24)) 
(c) is based on VAR(X) = E(X 2 ) - {EX) 2 < E(X 2 ) 
and is similar to the assertion that £J(VAR(X|Y)] which is 
the MMSE estimation error is not worse than the LMMSE 
estimation error (except our disregard for the mean). 
Therefore for a Gaussian X: 



I(X;Y) 
1 



h(X) - h{X\Y) 



\og(2TreE(X 2 )) - h{X\Y) 



log(l-p 2 ) (89) 
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□ 



Proof of corollary 2.1 Equality (a) holds only if X\Y is 
Gaussian for every value of Y, (b) holds if X has fixed 
variance conditioned on every Y, and (c) if E(X — a-Y\Y) = 
=*> E(X\Y) = a ■ Y, therefore it results in X\Y ~ 
J\f(aY, const) which implies X, Y are jointly Gaussian (easy 
to check by calculating the pdf). 

Note that if X, Y are jointly Gaussian then Y can be 
represented as a result of an additive white Gaussian noise 
channel (AWGN) with gain operating on X: 



Y ~ E(Y\X)+Af(0,VAR(Y\X)) = a-X+JV(0,a 2 )+const 

(90) 

consider X = Y = Ber(h), in which 
1, therefore the assertion doesn't 



To show corollary 2.2 
case I(X; Y) = 1 and p 
hold. 



/ Vi 








//cos(a)=t \ w 


— I ► 


'"\ 1 





Surface of 
sphere where 



Fig. 6. A geometric interpretation of Lemma [3] 



we have: 



Z). Proof of Lemma 4 

Write the empirical correlation as 



T 

x J y 



|y| 



(91) 



From the expression above we can infer that p does not depend 
on the amplitude of x and y but only on their direction. 
Since x is isotropically distributed, the result does not depend 
on the direction of y (unless y = in which case it is 
trivially correct), therefore it is independent of y and we can 
conveniently choose y = (1, 0, 0, . . . , 0). To put the claim 
above more formally, for any unitary n x n matrix U we can 
write: 



T 

x J y 



x T U T Uy 



V^xHyTy) 



V(x r lFUx)(y^lFUy) 
' ' Uy 



Ux 



|Uy| 



(92) 



Since x is Gaussian, Ux has the same distribution of x, 
thus the probability remains unchanged if we remove U from 



the left side and remain with 

y 

row is 



(h) (i 



Uy 

Uy| 



For 



0, we may choose the unitary matrix U whose first 

and the other rows complete it to an orthonormal 

basis of the linear space R". Then Uy = (||y||,0,0, ..0) and 

' Uy 
Uy|| 



therefore 



= (1,0,0, 

°) ' (m) 



.0). Thus the distribution of 



p' = (1, 0, 0, . . . , 0) • (^pyj = p|y equals the distribution 
of p. Assuming without loss of generality that x ~ AT n (0, 1) 



|,,( H ' i = p, 1o^l = 



= Pr(*?>* 2 (||x?f + x\)) = 




g-2 13^11-2 II .dx% = 



( 27r )(n-l)/2 

i (2tt)(™- 1 )/2 
2(1 - t 2 )^ 1 j i /aa-Mo.i-^)^) • = 

= 2(l-t 2 )^ = 2exp(-{n-l)R 2 (t)) (93) 



where we used the rough upper bound of the Gaussian error 
function Q{x) = Pr(7V(0,l) > x) < e'* 2 / 2 , and /jv»( M)ff a) 
denotes the pdf of a Gaussian i.i.d. vector. □ 
Discussion: A geometrical interpretation of Lemma prelates 
this probability to the solid angle of the cone {x : \p\ > t}. 
Since x is isotropically distributed, the probability to have 
\p\ > t equals the relative surface determined by vectors 
having \p\ > t on the unit n-ball (termed the solid angle). 
Since p is the cosine of the angle between x and y the 
points where \p\ > t generate a cone with inner angle 2a 
where cos(a) = t and their intersection with the unit ri-ball 
is a spherical cap (dome), shown in figure [6] We can obtain 
a similar bound as above using geometrical considerations. 
Write the volume of an n dimensional ball as V n r n where 
V n is a fixed factor V n — T {i+ n /2) E3 » an d accordingly the 
surface of an n dimensional ball is (the derivative) nV n r n ~ x , 
then the relative surface of the spherical cap can be computed 
by integrating the surfaces of the n— 1 dimensional balls with 
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radius sin(0) that have a fixed angle 8 with respect to y, and The set is minimal in the sense that none of its elements 

can be bounded as follows: can be removed while meeting the conditions of the lemma. 

^ We would like to bound the probability of Ja- The result of 

Pr(|/3| > f) = ur ace _ ca P _ max(zi, 0) is a partial sum of z%, and since negative Zi are 

^ ^ ur !^ ce n °t summed, it is easy to see this is the maximal partial sum, 

= . / l n — l)V n -i sm n ~ 2 (9)d6 < i-e- we can write this sum alternatively as 

nV n J e=0 

V i „ f a / max(zj, 0) = max > (98) 

< Y^l . sin n-3 (a) / s - m{e)d0 = Iev 
'n J6»=0 

. „_ 3 , . , , ^ where V = 2^ 1 ' - ' p ^ \ denotes all non empty sub-sets of 

V n ~ {!,-■• ,p}, and its size is 2 P — 1. Therefore from the union 

a<f 1/ , bound we have: 

< 2 ^i. sin «-3 (a)(1 _ cos 2 (a)) = 

f „ /ll Y J|2 



Pr{ J A } = Pr \ maxg ( JjgL - Ai ) > A \ < 



= O(Vn) • sin"" 1 (a) = 0(V«) • - cos 2 (a) 

= O(Vn) • (1 -t 2 )^- 1 )/ 2 (94) 

where the asymptotic ratio ^7=^7- — > 1 is based on li28l ( ieI 
Eq.(99). An interesting observation is that the assumption of Tq bound ^ aboye probability we first develop bound on 
Gaussian distribution is not necessary and this bound is true ^ probability Pr < Q ) for some coeffi cients 04: 

for all isotopical distributions. 

Lemma 7. Let x ~ A/"(0,P) n . For coefficients {aj}£ =1 w/f/i 
y^ ; - XiCLi = a > one/ |<Zj| < A where \a\ < ^A, we have 

E. Proof of Lemma 6 



We denote x i7 y j; as the sub-vectors over Ai (i.e. x, = p r V^Hx^ 2 < < e~ nE (100) 

x^ 4 ,yj = yAi), their length by = |j4j| and their relative \ , / 

length by A^ = rii/n. We are interested to find a subset w h ere 

J of x with bounded probability such that outside the set _ a 2 QOD 

Si ^iPi — p 2 — A for any y. Consider the following QA 2 

inequality: Now we apply the bound to the events in Eq.(99i: 

2 /||x,;|| 2 \ 

||x|| 2 .||y|| 2 .p 2 =(x-y) 2 =^xf yi ) = §lWF~ A V >A 

= (x>iwhi»ii) ? fetfiwi 2 ) • (eh^ii 2 ) = Ewi'-E^E wi 2 > A E iwi 2 



2 



-A,) I • l|x|| 2 -||y|| 2 < 



j2 A + ^A t -Ind( ie /) ||x 4 || 2 <0 



A«,0] I • ||x|| 2 • ||y|| 2 1=1 ± 



(95) 



We have: 



where (a) is from Cauchy-Swartz inequality (b) is since PiZi < v v v 

zi for Zi > and piZi < for < therefore always a = J]] Aid, = A • A, + Ai • A^— 

piZi < max(zj,0) (attained for pi = Ind(zj > 0)). Both i=1 i=1 i&i i=1 
inequalities are tight in the sense that for each x there is a 

sequence y (equivalent to choosing {||yi|| 2 } and {pi}) that - 2^Ind(i G = A (102) 

meets them in equality. Dividing by ||x|| 2 • ||y|| 2 we have that 1=1 

2 And I Oi| < 1 + A = A, therefore for A < 1/7 we have 



^A^ 2 <^ max (ML-Ai,0j (96) a< Hand by Lemma 



P 2 

l % Villi / 

where the RHS depends only on x and should be bounded by Pr ) E ( lh<4j2~ ~ Al ) > A f - e - e " E ° (103) 
A. Thus the minimal set Ja is: 

where 



. i&l 



J A ^<|x:^max(f^L-Ai,0 >A (97) - « 2 ^ ^ ^ 



^ " 6^ " 6(1 + A) 2 " 6(1 + 1/7) 2 " ~8~ = ^° (1 ° 4) 
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TABLE III 

Parameters of adaptive rate scheme used for figure[3] 



Item 


Referrence 


Parameter set 1 of figure |3| 


Parameter set 2 


Transmission scheme 


section |V-C| 


n = le + 008, K = le + 
006, P A = 0.001, P e = 0.001 


n = le + 020, K = le + 
017, P A = 0.001, P e = 0.001 


Rlbi parameters 


section |VI-C2| Eq.|65| 


T = 2.5e + 005, A M = 
37.5412, A = 0.0345958,7yi = 
0.996007,772 = 0.999962, ei = 
0.01 


T = 7. 5e + 015, A M = 
77.4043, A = 3.14616e - 
007,?7i = 1,772 = l,ej = 0.001 


i?LB2 parameters 


section|VI-C2| Theorem|4| 


p = 0.9, e = 0.139438, R = 
1.05173 


p = 0.99998, e = 
0.0068209, R = 7.29818 



and from Eq.(99i we have 



p.-™<!>{£(^ 

lev (. iei v 11 1 



— A,- > A > < 



< \V\ ■ e~ nE ° < 2 p e~ nE ° (105) 

which proves the lemma. Note that different bounds can 
be obtained by applying the bound on m smaller sets in 
and requiring that the sum over each set will 
be bounded by A/rn (as an example we could bound each 
max(zj,0) separately by A/p), however this bound is most 
suitable for our purpose since when p « n the element 2 P 
becomes negligible. □ 
Proof of Lemma [7| We assume without loss of generality 
that x - W(0, 1)". For Gaussian r.v. X ~ W(0, 1) and a < \ 
we have: 



/DO 1 
-oo V27T 



y/l - 2a J^oc ^2tt(1 -2a)" 1 



2(l-2a)-l fa 



(106) 



VI -2a 

For coefficients {ai] p i=1 with ^ \ai = a > and \di\ < 
A, w > a positive constant of our choice, and x ~ J\f(0, l) n 
we have: 

lnPr ^ajxj 2 < O^J < ]nEe->^ ai ^ 2 = 

7 jeAi 



^A 4 Uwai) - -- 



1 1 



i jeAi 
(a) 1 

n > \ l [//'.//■)—_ 



(«J • a7) 2 J < 

(107) 



= — n aw ; — r 

2 V 2(1 — w A) 2 

where (a) is based on the second order Tailor series of ln(l + 
wt) around t = with some tj G [0, Oj]U[oj, 0] and (b) is since 
\U\ < < A, For simplicity we choose a sub-optimal w* — 
-j2 (which is obtained by assuming small a, w and optimizing 



the bound with respect to w ignoring the denominator) and 
obtain: 

A 2 w* 2 _ a 2 a 2 1 A 2 



2(1 - w* ■ A) 2 A 2 2(1 - a/A) 2 
,2 f A 2 



A 2 V 2{A-a) 



(108) 



To simplify the bound, we make a further assumption that 
la] < 5 A therefore: 



A 2 \ . a 2 



> 1 



.4 2 



A 2 V 2(A-a) 2 y-A 2 V 2-(7/8) 2 -A 2 

Therefore we can write the following bound: for \a\ < |A we 
have 



Pr^>|W| 2 <0^ ^ 



e -" B (110) 



where E = Note that the bound is true for any x ~ 

7V(0,P) n . □ 

F. Parameters of adaptive rate scheme used for figure^ 

Table |lll] lists two sets of parameters for the continuous 
alphabet adaptive rate scheme. The first set was used for the 
curves in figure [3] and the second set shows the convergence of 
e, R, for higher values of n, K. Note that the values of n, K 
are extremely high, and this is due to the looseness of the 
bounds used in the continuous case: specifically the exponent 
of Lemma [6] which yields a relatively slow convergence of the 



ill-convexity probability in equation 56 
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